+ All Categories
Home > Documents > Trustworthiness of open government data - KIA Pleio

Trustworthiness of open government data - KIA Pleio

Date post: 22-Feb-2023
Category:
Upload: khangminh22
View: 0 times
Download: 0 times
Share this document with a friend
61
1 Trustworthiness of open government data an analysis on the requirements for open government data from the perspective of authoritativeness and sustainable accessibility 10th of July 2020 Jan Koers Student id 11933984
Transcript

1

Trustworthiness of open government data

an analysis on the requirements for open government data

from the perspective of authoritativeness and sustainable accessibility

10th of July 2020

Jan Koers

Student id 11933984

2

Table of contents

1. Introduction ..........................................................................................................................................4

2. Trust as a basic requirement for our information society ...................................................................5

2.1 Introduction ............................................................................................................................. 5

2.2 What defines trust and how is authoritativeness of information related .............................. 5

2.3 The difference between data and records from the perspective of trust .............................. 6

2.4 Recordkeeping principles that support trust .......................................................................... 8

2.5 Conclusions .............................................................................................................................. 9

3. Assessing current open government data on trustworthiness ......................................................... 10

3.1 Why open data ...................................................................................................................... 10

3.2 Research questions and expected outcome ......................................................................... 10

4. Recordkeeping principles supporting trustworthiness of open data ............................................... 13

4.1 Key components of trust and supporting recordkeeping principles ..................................... 13

4.2 Focus group about valorisation of key components of trust ................................................ 16

4.3 Form to assess the open data ............................................................................................... 20

5. Incorporation of recordkeeping principles in Dutch open data ....................................................... 22

5.1 Introduction ........................................................................................................................... 22

5.2 The Dutch base registry of buildings and addresses (BAG) ................................................... 24

5.3 Air quality measurements by Dutch institutions................................................................... 31

5.4 Crowd sourced air quality data ............................................................................................. 41

5.5 Analysis of the research data ................................................................................................ 48

5.6 Assessment of the assessment framework ........................................................................... 52

6. Recommendations to promote trust of open data environments.................................................... 53

6.1 Translation of the research findings to recommendations ................................................... 53

6.2 Standardized and inseparable minimum metadata set for individual information objects . 54

6.2 Source information system as archiving and publication platform ...................................... 55

7. Conclusions ................................................................................................................................. 57

Bibliography ........................................................................................................................................... 59

3

List of figures

Figure 5.1 Formalisation continuum of data creation environments ................................................... 22

Figure 5.2 Visualisation of the building and address information using the BAG viewer ..................... 26

Figure 5.3 Infographic on the re-use of BAG data ................................................................................. 31

Figure 5.4 Visualization of actual measurements of fijnstof (PM10) at June 6th 2020, 14:55 hours ... 36

Figure 5.5 Report of historical measurements of fine dust (PM10) on August 6th 2018 ..................... 36

Figure 5.6 Data acquisition, processing and publication of sniffer bike data ...................................... 43

Figure 5.7 Visualisation of sniffer bike open data with specific web application ................................. 43

Figure 6.1 Possible information architecture with integral recordkeeping functionality ..................... 56

List of tables

Table 3.1 Research questions and proposed research methodologies ................................................ 12

Table 4.1 Relation of trust components (3 perspectives), trust type and recordkeeping principles .... 15

Table 4.2 Assessment form for assessing recordkeeping principles in open data sets ........................ 20

Table 5.1 Assessment form for BAG data .............................................................................................. 27

Table 5.2 Assessment form for air quality measurements by Dutch institutions ................................. 37

Table 5.3 Assessment form for crowd sourced air quality data ............................................................ 45

Table 5.4 Classification of presence of recordkeeping principles in 3 assessed open data sets........... 49

4

1. Introduction

For thousands of years information has been guarded by passing on stories and writing them

down on physical media like paper. In the last decades digitization started with the conversion

of written or typed documents into digital representations by scanning, with no major changes

in the essence of use of the information.

But with the technologies developing in a fast way, now a real digital transformation is taking

place. Formal and informal communication and interaction becomes more and more digital, as

can be concluded on the wide range of official digital government programs and the

incredible use of social media platforms.

These technology changes induce societal changes: from changing democracy by platforms

for e-democracy on one side to informational isolation of individuals and groups because of

information bubbles. There is ongoing debate on how to manage these technology and

societal chances: promoting transparency and openness of government (like the future Dutch

legislation “Wet Open Overheid”), the protection against abuse of personal information (like

the European General Data Protection Regulation) and the growing use of data and black box

algorithms in automated decision making processes without knowing their exact quality and

functioning.

All these technology and societal developments imply challenges for information

management: how can we deal with the fluidness, temporality and context change of

information? How can we support the transparency and a sound interpretation of data and

algorithms. How can we support the trustworthiness of digital data?

This research investigates on the aspects of trust of data as a basic requirement for the current

trend of datafication in our society. It elaborates on the difference between data and records

and the recordkeeping principles that can support trust of data. Consequently several Dutch

open data sets are being assessed on their level of trustworthiness. Finally recommendations

are given to integrate recordkeeping principles in data governance.

5

2. Trust as a basic requirement for our information society

2.1 Introduction

Current developments in our information society stress the importance of the concept of trust

of information. In recent research our time has been referred to as an “era of post-truth and

disinformation” (Duranti, 2018). A complete research program in archival science is focussing

on this aspect (the Interpares Trust program1). Upward et al. (2018) elaborate on

“authoritative information resource management” as crucial for our networked information

society.

That raises the question what actually defines trust and how authoritative information

resource management is related to trust. Is there is a difference between the concept of data

and the concept of records from the perspective of trust? What (recordkeeping) principles are

key to support trustworthiness of data?

2.2 What defines trust and how is authoritativeness of information related

In Webster’s dictionary trust is being defined as “Assured resting of the mind on the integrity,

veracity, justice, friendship, or other sound principle, of another person”. Trust therefore is

related to the perception of persons on the one hand and sound principles on the object to be

trusted. In this case the object is information for which integrity and veracity would be the

applicable principles.

In archival science Donaldson & Conway (2015) studied the user conceptions of

trustworthiness for digital archival documents. They concluded that the formulation of

trustworthiness as a concept in Kelton, Fleischmann, and Wallace’s (2008) Integrated Model

of Trust in Information can be questioned. Besides the aspects of accuracy, believability,

coverage, currency, objectivity, stability and validity in the model, Donaldson & Conway

revealed that aspects as authenticity, inaccurate/trustworthy, first-hand/primary,

legibility/readability and form are perceived relevant as well.

Wang et al. (2011) in their study on trust of machine created data concluded that trust is

related to the reputation of the creator and/or publisher of data. The reputation of the creator

and/or publisher is related to the supposed quality of the data created and/or published. The

quality indication is based on the assessment by experts, reputation rules of computational

systems or the re-users of the information. Ceolin et al. (2016) combine user reputation and

provenance analysis, looking at provenance for the estimation of trust in an automated way.

They designed a series of algorithms to extract relevant provenance features, generate

stereotypes of user behaviour from the provenance features and estimate the reputation of

both stereotypes and users. Based on that they estimate the trustworthiness of “artefacts”.

Upward et al. (2018) use, in their publication “Recordkeeping informatics for a Networked

age”, the term authoritative information resource management and relate this to Giddens

social theory. They elaborate that our societal values determine whether authoritative

1 https://www.interparestrust.org

6

information is important for the functioning of society and what information will be

considered as authoritative. Giddens theory stresses that social structures are based on

relations and shows how social structures survive in space and time. As a parallel the authors

describe that records are structured reflecting the relations in society and are therefore based

on relations as well. Consequently, the survival of information in space and time is related to

how social structures survive. They conclude that in current information resource

management the authoritative approach has gotten out of sight, caused by cultures that merely

value the here and now.

Gillian (2018) concludes that there are many socio-cognitive factors in play and a reliable

source alone does not consistently determine trustworthiness for users. Yoon (2014) concludes

on the trust of data in repositories that “trust in data itself plays a distinctive and important

role for users to reuse data, which may or may not be related to the trust in repositories”. He

concludes that users’ trust in data is another important area to be investigated further, creating

a bridge to the data governance domain.

Based on the literature related to trust of information, these two types trustworthiness will be

taken into account in this research, the trustworthiness of the source of information, which

will be referred to as social based trust, and the trustworthiness of the information itself which

will be referred to as content based trust.

2.3 The difference between data and records from the perspective of trust

In the previous paragraphs, three different terms have been used without defining them yet:

data, record and information. This paragraph introduces definitions of these terms within the

context of this research and looks at the difference between data and records from the

perspective of trust.

The distinction between the terms data and record is blurred. Classical records are supposed to

reflect subjective, context dependent and human induced information which is subject to

interpretation and is associated with authority and actions. Data (and the devices and

algorithms that create and/or process them) are considered objective, accurate, true and

therefore neutral, which can objectively be analysed to derive patterns and/or knowledge and

be able to project future behaviour or developments.

But data are not neutral at all: the design of sensors, the modelling of data models, the

classification of information, the design of algorithms, the results of analysis, all are human

activities that part from an interpreted and/or subjective theory or design that influences the

way the data can be used and interpreted. Additionally, current technologies enable us to

convert classical records to data by extracting data from for instance textual or audiovisual

information. The other way around classical records can be generated from data.

7

Borglund et al. (2014) state that the terms data, records and information can be used equally

but that their use might have different connotations. They argue that the term record is often

used when the legal context or meaning is important which coincides with the ISO 15489

definition of records as “Information created, received, and maintained as evidence and

information by an organization or person, in pursuance of legal obligations or in transaction of

business.” The term information is being used if there is a more customer orientation with

usage, accessibility and benefits as its core focus. The term data mostly refers to easily

processable (tabular or structured) data, sometimes referred to as “raw material”. The

growing generation and use of this type of data explains the rising use of the term data.

Borglund et al. conclude that the different stakeholders have different perspectives and stress

on different aspects, which is why different terminology is being used.

This research is related to the creation and use of government information. This information is

almost every time directly or indirectly part of a process that has legal or transactional

characteristics. Still different terms are being used, like data, records and information. This

research is focussing on the role of recordkeeping principles, identified in archival science to

support trustworthiness of information, to analyse trust aspects of open government data.

Therefore the basic terminology has been based on the descriptions of terminology used in the

reference project Preservation as a Service for Trust (PaaST) from the Interpares Trust

Program (2018)2. This reference document has at its core the definition of an “intellectual

entity”:

• Intellectual entity: artefacts that are intended to communicate information; this

encompasses human readable entities like texts and photographs and machine readable

entities like databases and software

• Data: objects that are the ingredients of intellectual entities; this includes objects that

are directly or indirectly created by humans, such as data that capture details of human

interactions with social media or online systems, data generated by environmental

sensors and the outputs of artificial intelligence systems.

• Record: A type of intellectual entity that was made or received in the course of a

practical activity as an instrument or a by-product of such activity, and set aside for

action or reference

• Information: the communication result of intellectual artefacts: this encompasses for

instance the visualisation of environmental data on a map to understand air pollution

or a building permit used for starting a construction process.

Because all data are made or received in the course of a practical activity, the key

characteristic of a record therefore is that is it “set aside” for action or reference, and therefore

can serve as evidence.

2 https://interparestrust.org/terminology/term/Preservation as a Service for Trust (PaaST)

8

Within this research, “set aside” encompasses that an intellectual entity has been defined

beforehand and is created for reuse (within new actions or as a reference in new actions) and

therefore should have the characteristics that make it possible to do so: it should trustworthy

and it should be accessible, readable and usable.

The characteristics that underpin trustworthiness, can be achieved and/or supported by the use

of recordkeeping principles in the design, creation and publishing of records. What

recordkeeping principles can support this trust is elaborated in the next paragraph.

2.4 Recordkeeping principles that support trust

As mentioned in paragraph 2.2, Upward et al. (2018) describe the challenges we have

nowadays in the field of what they call authoritative information resource management. The

authors, based on the social theory of Giddens, stress that at basis the (information) culture of

a social structure will define the importance of recordkeeping and recordkeeping principles.

Our societal values determine whether authoritative information is important for the

functioning of society and what information will be trusted or considered as authoritative.

The framework of the authors is directly related to the actor network theory. This theory

serves as a basis to define and implement the relations between agents (users / roles by means

of applications), records (intellectual entities set aside for action or reference) and actions

(business processes). Within current technological and networked information environments,

mostly based on data processing platforms, direct recording and archiving of information is

already, or will be, the standard. The challenge is to guarantee the authoritativeness of

information.

The authors distinguish 2 recordkeeping building blocks to support authoritative information

resource management in the current digital world:

• the record continuum model: in the digital age, technology enables and supports the

(re)use of records; this implies a focus on the creation, capture, and maintenance of

records in a way that sustainable accessibility for reuse is guaranteed; this can be

translated to the availability of repositories that guarantee access for (re)use, guarantee

integrity of the information and maintain readability and usability of information for

future use by means of preservation technologies

• recordkeeping metadata: recordkeeping metadata can support the accountability of an

organization and support the traceability, authenticity and sustainability of records: it

can establish what can be done with records (taking into account the objective for use,

and aspects of privacy and secretiveness of records) and how long to retain records;

this can translated to the practice to create, update and use relevant metadata in every

process that manages records; in this way metadata serve “to tell the story of

information” and is the basis of the (re)construction of (trans)actions; with the relevant

metadata, records can be used for evidence

These building blocks will be the basis for this research on trustworthiness of open

government data.

9

2.5 Conclusions

For the functioning of a democratic society it is essential that there is trust between the

members, be it government bodies, citizens, firms or educational institutions. Trust is

supported by the creation, distribution and use of information that is considered trustworthy.

Trustworthy information is supported by authoritative information resource management.

Aspects that support trust are the trustworthiness of the provenance of the information (social

based trust) and the trustworthiness of the information itself (content based trust). The

trustworthiness of the information is determined by its users and can be supported by the use

of metadata that describe who, where and how the data has been created, in what context and

what actions have been performed on it. When data and relevant metadata are accessible, it

can be verified and assessed on accurateness, reliability and authenticity. This supports the

content based trust and consequently the social based trust.

These concepts of trust of information will be further investigated in this research. Do the

identified building blocks support the trustworthiness of information and if so, how do they

support it. The next chapter details why open government data has been chosen as a context to

research this, which research questions will be investigated and based on what

methodologies.

10

3. Assessing current open government data on trustworthiness

3.1 Why open data

Open data are a good representation of current developments in the information society.

Governments want to be transparent and publish their data as open data if this doesn’t infringe

privacy or copyright legislation. Innovative information architectures like the municipal

Common Ground movement rely on (open) data repositories for business transactions making

use of technologies like linked data, services and application programming protocols (API’s).

More and more open data is published to be reused by others than the creators. These open

data can be considered records, because they are “intellectual entities, set aside for action or

reference” and therefore should be trustworthy. Nevertheless, open data face challenges

related to trust, specifically related to data quality and the difficulty to assess data quality

(Jaakkola et al. (2014)). In the Interpares Trust programme a study specifically focussed on

the “Implications of Open Government, Open Data, and Big Data on the Management of

Digital Records in an Online Environment” (Suderman J. and Timms, K. (2017)). The study

finds trust issues in open data and open government initiatives. These trust issues are related

to gaps in the “recordkeeping infrastructure” and a focus on “participatory or collaborative

governance” to increase trust. The overall objective of the study was to identify records-

related issues in order to support the establishment of appropriate InterPARES Trust research

projects to address the issues.

The open data landscape is changing rapidly and so is legislation. New Dutch legislation is in

preparation (called the “Wet Open Overheid” and related to the new European directive for

open data 2019/10243). With the directive there are also general quality requirements related

to the use of metadata, standardization and accessibility.

This research is related to trust issues of open data and the relation to recordkeeping

principles. Its objective is to find the relevant recordkeeping infrastructure components

needed for Dutch data governance and open data landscape to support the trustworthiness of

open data. The next paragraph poses the research questions and presents the research

methodology.

3.2 Research questions and expected outcome

In current data governance environments there is a lack of attention for the capturing,

processing, recording, reuse and publication of information so that this information can be

trusted and can be used in an authoritative way. Current information creation is that fast and

that complex that it is difficult to assess whether information is authoritative and whether it is

true or that is has been changed. Casellas et al. (2012) indicate that from a record management

perspective data quality is one of the most relevant problems in (open) data projects. Lemieux

et al. (2014) showed in a study about the use of big data for visual analytics that the lack of

recordkeeping standards, related to link data to the context of their creation for proper

3 https://eur-lex.europa.eu/legal-content/NL/TXT/PDF/?uri=CELEX:32019L1024&from=EN

11

interpretation, caused knowledge gaps. More and more amateur based content building,

combined with direct archiving, raises the question whether we still are able to trust (sources

of) information. Duranti (2018) states that we live in a post truth era and that “objective facts

are less influential in shaping public opinion than appeals to emotion and personal belief”.

From the Interpares research program the implications of open government, open data and big

data on the management of digital records have been studied by Suderman & Timms (2017).

They conclude that open government and open data initiatives and structures do not have only

accountability as a guiding principle, but as well citizen participation, technology and

innovation, and transparency. They identify trust issues with respect to open government and

open data initiatives, underpinned by gaps observed in recordkeeping infrastructure and

operations.

The objective of this study is to find the relevant recordkeeping infrastructure components

needed for Dutch data governance and open data landscape to support the trustworthiness of

open data.

The expected outcome of this research is that a different view (still) exists between trust from

an archival perspective (more focussed on the trustworthiness of the provenance and the

repository itself, social based trust) and trust from a data governance & -science perspective

(more focussed on the trustworthiness of the content, the data quality, content based trust). It

is expected as well that the incorporation of recordkeeping principles (being part of

authoritative information resource management) in data governance can lead to more trust in

open data.

To reach the objective of this research, the following research questions have been formulated

and will be answered using the proposed research methodologies:

12

Table 3.1 Research questions and proposed research methodologies

Research question Proposed methodology

1. What are the key components of trust of open data and what recordkeeping principles support them

Literature study of both the archival science domain as data governance & -science domain to extract the key components and supporting recordkeeping principles

2. What are the differences in valorisation of the key components for trust from the archival perspective and the data governance & - science perspective

Focus group with archival and data governance experts to rate the relevance of the found key components in research question 1 and add new ones by the experts, if applicable. Different experts who are currently working on the (municipal) information architecture for data and records, including the open data arena, will be approached to review the key components for trust.

3. How do current Dutch open data sources promote trust by incorporating recordkeeping principles that support trust?

Investigation of 3 open data sources:

• one based on legislation, in this case the official Dutch base registry for buildings and addresses (BAG)

• one not based on legislation but created by a government institution with specific directives for the registry, in this case official air quality data created by RIVM

• one based on crowd sourced registry which is re-used by government, in this case air quality data acquired by volunteers using sensors on bikes

The investigation consists of determining whether the key components of trust are supported in the open data sources by the use of recordkeeping principles; the investigation will be based on available information for public use; to identify the relevance of the transparency and presence of all relevant information for public use, additional information may be solicited from the owner and/or publisher of the information when not available publicly.

4. What recommendation can be made for Dutch data governance and open data landscape to promote trust of open data (environments)

Analysis of the results of research questions 1,2 and 3 and elaborations of recommendations, if applicable related to current developments in the field

13

4. Recordkeeping principles supporting trustworthiness of open data

4.1 Key components of trust and supporting recordkeeping principles

The first research question has the objective to identify the key components in authoritative

information resource management that support the level of trust of open data. Literature has

been studied of both the archival science domain as the data governance & -science domain.

The concepts of both social based trust as content based trust have been taken into account.

From an archival perspective Donaldson & Conway (2015) studied the user conceptions of

trustworthiness for digital archival documents. They conclude that that the way of creating

and guaranteeing trust is to be able to preserve the identity and integrity of digital records.

Donaldson & Conway did a qualitative study on information in the form of images to

determine the components of trustworthiness and their relative importance. The following

components have been identified and confirmed to be relevant for trust from the archival

perspective related to the Integrated Model of Trust in Information from Kelton et al. (2008):

• Accuracy: believed to be free of error

• Believability: the extent to which the information appears to be plausible

• Coverage: completeness of the information

• Currency: the degree to which the information is up-to-date

• Objectivity: balance of content

• Stability: the persistence of information, both its presence and contents

• Validity: the use of responsible and accepted practices such as the soundness of the

methods used, the inclusion of verifiable data, and the appropriate citation of sources

Donaldson & Conway also uncovered the following emergent themes that were not identified

within the Integrated Model of Trust in Information:

• Perceived authenticity: Is it fake?

• Inaccurate information: conceptualizing documents as being trustworthy despite

containing inaccurate information

• Primary or first-hand evidence: the extent to which the document is primary or first-

hand

• Document’s legibility or readability

• Document’s perceived proper form (this relates to “coverage”, “stability, “validity”

and “readability”)

From a data governance & -science perspective Wang et al (2011), in their study on trust of

machine created data, conclude that trust is related to the reputation of the creator and/or

publisher of data. The reputation of the creator and/or publisher is related to the supposed

quality of the data created and/or published. The quality indication is based on the assessment

by experts, reputation rules of computational systems or the re-users of the information. The

reputation is based on transparency and consistency of the execution of business rules, which

determine a part of the quality of data. Data quality assurance is therefore crucial and directly

related to authoritative information resource management. Strong et al. (1997) and Mazon et

al. (2012) recognize “fitness for use” as being the key factor of data quality (based on the

14

components topicality, completeness, correctness and precision) and an important criterion for

data analytics and business. The two key components that are identified for trustworthiness of

data are the reputation of the creator and/or publisher and the quality of the data / fitness for

use.

From an open data perspective, the Sunlight Foundation defined and published Open Data

Principles (Sunlight Foundation, 2010). These principles are a combination of good-practices

and requirements to promote and support the (re)use of open data. These requirements are

supposed to support trust and to lower the barriers for (re)using open data. The principles are:

1) Completeness, including release of descriptive metadata, with the highest possible

level of granularity, which will not lead to personally identifiable information;

2) Primacy, collected at the source, including information on how and where data were

collected to allow verification by users;

3) Timeliness, data should be released as quickly as possible;

4) Ease of physical and electronic access;

5) Machine-readable, in formats that allow machine-processing;

6) Non-discrimination, available to anyone with no requirement of identification or

justification;

7) Use of commonly-owned or open formats;

8) Licensing, no imposition of attribution requirements and preferably labelled as part of

the public domain;

9) Permanence, data should remain online with appropriate version-tracking and

archiving;

10) Usage costs, data available preferably free of charge.

The key components of the open data foundation that are related to the trustworthiness of the

data itself are completeness, primacy and timeliness. The components from 4 to 10 are more

related to sustainable accessibility of open data. For his research the accessibility is mainly

taken for granted because the topic of research are open (accessible) data. However, two

aspects are taken into account that relate to the sustainability of the accessibility: machine-

readable (5) and permanence (9).

In paragraph 2.4 the recordkeeping principles that support trust have been identified:

descriptive metadata and repositories that guarantee integrity and preservation. In order to

verify whether the identified recordkeeping principles support all the identified trust

components an integration matrix has been made. Parting from the trust components from the

archival perspective that are the most granulated, corresponding trust components identified

by the data governance & -science domain and the open data foundation are connected (are

placed in the same row) based on their definitions. Each row is then identified as a component

that relates to social based trust and / or the content based trust. Finally the identified

recordkeeping principles are related to each combination of trust components.

15

Table 4.1 Relation of trust components (3 perspectives), trust type and recordkeeping principles

Archival user perspective

Data science perspective

Open Data Perspective

Related to social based trust and/or content based trust

Supporting Recordkeeping principle

Accuracy Fitness for use4 Completeness Content based trust Descriptive metadata5

Believability Reputation of creator and/or publisher Fitness for use

Primacy Completeness

Social based trust Content based trust

Descriptive metadata/Repository that guarantees integrity and preservation

Coverage Fitness for use Completeness Content based trust Descriptive metadata

Currency Fitness for use Timeliness Permanence

Content based trust Descriptive metadata

Objectivity Reputation of creator and/or publisher

Primacy Social based trust Content based trust

Descriptive metadata

Stability Reputation of creator and/or publisher

Permanence Social based trust Content based trust

Repository that guarantees integrity and preservation

Validity Reputation of creator and/or publisher

Primacy Social based trust Content based trust

Descriptive metadata

Perceived authenticity6

Reputation of creator and/or publisher

Primacy Social based trust Content based trust

Repository that guarantees integrity and preservation

Inaccurate information

Fitness for use Primacy Social based trust Content based trust

Descriptive metadata

Primary or first-hand evidence

Reputation of creator and/or publisher

Primacy Social based trust

Repository that guarantees integrity and preservation

Document legibility or readability

Fitness for use Machine-readable Commonly-owned or open formats

Social based trust Content based trust

Repository that guarantees preservation

Document’s perceived proper form

Fitness for use Primacy Social based trust Content based trust

Descriptive metadata

4 Fitness for use is the main component of data quality. It is the extent to which data is suitable for the purpose that it has been created and used; aspects related to data quality are topicality, completeness, correctness and precision; a good description also helps to assess whether the data will be suitable for re use for another purpose

5 Descriptive metadata refers to all recordkeeping metadata that allow unique identification of the record, allow the verification of the provenance (including the juridical and administrative context, i.e. actors and the procedural context, i.e. processes, use of standards etc) and describe the context of the record that support the reliability and usability (including accuracy, coverage, currency)

6 In this research authenticity is defined as the combination of verifiable identity and verifiable unchanged original content

16

Resuming the integration matrix it can be concluded that:

• although there are differences in terminology and granularity, all identified trust

components from the three perspectives are present, can be related and can be

supported by the use of recordkeeping principles

• most combinations of trust components are related to both content based and social

based trust; this confirms the interrelation between both forms of trust: social based

trust induces content based trust and the other way around

• from an open data perspective most components are not incorporated in the integration

matrix because they are related to (sustainable) access (the components ease of

physical and electronic access, non-discrimination, licensing, usage costs); the

components related to access and legibility are incorporated (machine-readable and

use of commonly-owned or open formats)

In the above approach it is assumed that information is available and accessible. From the

perspective of this research this assumption is logical because open data is assessed. However,

the availability and accessibility of information in general is not obvious. Public bodies are

encouraged or in some cases obliged to publish their information with the related minimal

requirements of the use of (standard) metadata and open standards for accessibility (like

linked data, use of standard API’s, use of standard services). Because of the importance of this

aspect of accessibility related to legibility (machine-readable and use of commonly-owned or

open formats) this will be assessed as well in this research.

In this paragraph the key components in authoritative information resource management that

support the level of trust of open data have been identified, including the recordkeeping

principles that support them. This identification is based on literature and has resulted in the

elaborated integration matrix. It presents the corresponding trust components from 3

perspectives, their relation to social and content based trust and the related recordkeeping

principles that support the trust components.

In order to verify and enrich this result, the integration matrix has been analysed by a small

focus group of experts that are working on new architectures for sustainable access of digital

information, including data. The results of this process are presented in the next paragraph.

4.2 Focus group about valorisation of key components of trust

Literature has shown a broad approach to the trustworthiness of information. The integration

matrix is the result of a literature review with a focus on open government data and presents

the key trust components from both the archival perspective and the data governance & -

science perspective. In order to verify and enrich this result several experts had been invited

to participate in a focus group. Only 2 experts have been able to do that, Paul Groth of the

University of Amsterdam and Erik Saaman of the National Archives of the Netherlands.

Looking at the integration matrix with the identified key components for trust, the experts

indicated that the matrix could need a better justification and/or explanation concerning the

17

corresponding terminology between the archival-, the open data, and the data governance & -

science perspective. The objective of this research however is to find the recordkeeping

principles supporting the trust components and assess current open data based on that.

Therefore in-depth comparison is beyond the scope of the research. To tackle this feedback

the integration matrix has been converted in a relational diagram illustrating the relation of the

recordkeeping principles to each of the trust components of each particular domain

perspective (see figure 4.1).

The archival perspective focusses on information that has to be trustworthy over time, the data

perspective focusses on information that has to be trustworthy in content and the open data

perspective focusses on the accessibility of information. Recordkeeping principles support the

trustworthiness of information regardless of their form and use. The figure shows that all trust

components are covered by three recordkeeping principles.

Secondly, the difference between trustworthiness based on provenance or trustworthiness

based on the information itself was discussed.

As elaborated earlier, trust can be based on social structures. It reflects the trust in a social

system with known identities of the actors. Provenance describes and confirms where

information comes from and how it was created and/or obtained. This social trust is based on

trust in the creator and/or provider of information. It is often based on clear and verifiable

procedures for the creation and maintenance of the data, which support the assessment of the

quality of the data. If trust in the social system is not present or declining, alternatives arise

like the creation of systems based on blockchain technology. These initiatives pretend to

support trust based on complete transparency and the consensus of a community to support

the audibility of the provenance of information. Key aspects are verifiable identity and

verifiable transparency.

Figure 4.1 Relations of trust components of 3 perspectives with recordkeeping principles Figure 4.1 Relations of the trust components of each perspective with the recordkeeping

principles

18

Trust can be based on the (quality of the) information itself as well. This trust can be obtained

by assessing the content of the information based on its characteristics. This can be done

independent from the social trust but it can also be one of the backbones of the social trust

(“this creator of data always produces good data”). To assess the content of the data however,

supporting structures with contextual information are needed to understand the data (Borgman

2015). This process of sensemaking of data has been object of study. Faniel et al. (2019) have

created a typology of the contextual information needed to support data evaluation across

three disciplinary domains, finding that information about data production, data repositories,

and data usage are key in making decisions about reusing data.

It was concluded that it is important to distinguish between these two different perspectives of

trust, but the terminology used to distinguish between trustworthiness based on provenance or

trustworthiness based on the information was not that clear. Therefore clearer terminology

was suggested: social based trust and content based trust. This terminology has been

incorporated retroactively in this research.

Thirdly, a focus was put on the metadata that is most relevant for trust in data. In “Talking

datasets – understanding data sensemaking behaviours”, Koesten et al. (2020) elaborate on a

study focussing on the user perception of datasets and identified necessary data attributes that

support trust because they support sensemaking activities like inspecting data, engaging with

content, and placing data within broader contexts. The research concludes on documentation

practices which can be used to facilitate sensemaking and subsequent data reuse and therefore

support trust and long-term preservation of meaning.

The recommended attributes for data are attributes that describe the following basic

characteristics: format, provenance (including research field and methods used), purpose,

topic, location, quality and uncertainty, time. Nevertheless the authors plea for a more

expanded and robust way then current standardized metadata and reporting conventions to

structure and to support sensemaking of data.

Above mentioned attributes have been compared with the metadata standard that is being

proposed by the National Archive of the Netherland in the design for MDTO (Metagegevens

voor Duurzaam Toegankelijke Overheidsinformatie7). This minimal set of descriptive

metadata is being defined by the National Archive, in cooperation with de Dutch Association

of Municipalities and other governmental bodies. It is supposed to cover the trust components,

although its objective is to support the process of “overbrenging” (i.e. transfer) of digital

information to archival repositories. Nevertheless it should be applicable as well to data that

will be available as records from their source information system or in specific open data

repositories.

The requirement of a minimal set of descriptive metadata will be used as one of the key

recordkeeping principles to assess the open data sets on in this research. The assessment form

that is elaborated in the next chapter has been detailed with those metadata elements found

7 https://www.nationaalarchief.nl/sites/default/files/field-file/OD08_TMLO.pdf

19

important for trust that coincide with equal or similar metadata elements within the MDTO.

An additional focus will be on candidates for expansion of current standardized metadata if

applicable to the data to be researched, like certain types of information structures attached or

linked to the dataset, records and/or columns that provide context and facilitate particular

sensemaking patterns over others.

Finally a reference and comparison was made with the FAIR principles8. The acronym stands

for findable, accessible, interoperable and reusable data. The FAIR guiding principles

promote to optimize data sets for reuse by both humans and machines. This implies sufficient

human readable and/or machine readable metadata to describe the datasets and its

components. Of these principles the following two are most relevant to this research:

“(Meta)data are associated with detailed provenance” and “(Meta)data meet domain-relevant

community standards “.

While these fair principles refer to (meta) data, a related initiative appeared called TRUST for

FAIR.9 These TRUST principles describe the requirements for data repositories for managing

and disseminating data over a long period of time. The most relevant components for this

research are the requirements related to reliable and secure operations (related to the

integration matrix recordkeeping principle of a repository that guarantees the integrity of the

information) and the support of long term data and knowledge preservation (related to the

integration matrix recordkeeping principle of a repository that guarantees preservation).

The results of the focus group have shown that there are no fundamental differences in

valorisation of the key components for trust from the archival perspective and the data

governance & - science perspective. Both perspectives recognize the importance of the key

components. The feedback of the focus group has been incorporated in the research. The

integration matrix was revised by the two experts, each from its own perspective (the archival

perspective and the data governance & - science perspective). The investigation of this

research consists of determining whether the key components of trust are supported in the

open data sources by the use of recordkeeping principles. Therefore an assessment form has

been elaborated to assess the open data sets on this. This form is based on the literature

review, the integration matrix and the feedback of the experts. This assessment form is

presented in the next paragraph.

8 https://en.wikipedia.org/wiki/FAIR_data

9 https://www.slideshare.net/daweilin/trust-principles-2019rda

20

4.3 Form to assess the open data

The integration matrix and the related relational diagram (figure 4.1) have been transformed

to an assessment form (Table 4.2).

Table 4.2 Assessment form for assessing recordkeeping principles in open data sets

Descriptive metadata

Minimal available descriptive metadata: can juridical and administrative context, i.e. actors and the

procedural context, i.e. processes, use of standards be derived (is this descriptive metadata available at

dataset level, record level or attribute level

Is the format of the data described?

Provenance General: can the identity of the creator be identified and verified?

Provenance Specific: Can the process / methods be identified that have been used create or derive the data?

Are the data primary (no selection and no processing (except processes of dissociation of personal data) and no

copying (accessing the source)).?

Topic: what does the data represent; does the data have a defined taxonomy or ontology ?

Purpose: for what purpose has it been created?

Location: to what location is the data related?

Time: To what time frame is the data related? What’s the actuality of the data and metadata; does historical

data remain online with appropriate version-tracking and archiving; is continuous availability guaranteed by a

legal or policy commitment?

Quality and uncertainty: is a description of correctness, completeness, and precision/granularity present?

What other non-domain specific metadata are available that support trust components?

Are the metadata comprehensive; do the metadata comply with metadata standards (international standards,

domain standards)

Repository

Repository Accessibility

Are the data findable and accessible : How can the data be found (portal, search engine), how can the data be

accessed (API, linked data/URI, (download) portal), which commonly owned or open formats (including

machine readable) are used

Is the data is accessible from its source information system or a recognized repository for publishing

Repository Integrity

Are technologies being used that guarantee the integrity of the source information system or repository:

securing that the information at the right moment has been “frozen” and therefore cannot be changed

anymore?

Repository Preservation

Does the information system or repository incorporate technology of preservation to guarantee readability and

usability in the future? Are the records (metadata and data) available in a sustainable format (5 star model of

Tim Berners Lee (http://5stardata.info/en/) : Open licence, not structured or open format/ machine-readable

structured data / non-proprietary format/ open standards from W3C / linked open data)

21

The form is the principal component of the analytic framework for the assessment of the open

data sets. It enables to assess how current Dutch open data sources support the key

components of trust by incorporating recordkeeping principles.

The form consists of questions regarding the presence and/or incorporation of identified

recordkeeping principles. Each recordkeeping principle has been translated to relevant

questions that indicate whether the recordkeeping principles are present or whether the

objectives of the recordkeeping principles are met. These detailed questions are based on the

literature study and the feedback from the experts of the focus group.

As indicated earlier, the essence of open data is that it is available and accessible. However it

is not obvious that this is done using (open) standards for accessibility (like linked data, use of

standard API’s, use of standard services). Therefore these accessibility-related components

are being assessed as well within the assessment form.

The presented form is used to assess each of three open datasets within the Dutch open data

landscape on the level of incorporation of recordkeeping principles. The results of the

investigation are presented in the next chapter.

22

5. Incorporation of recordkeeping principles in Dutch open data

5.1 Introduction

The third research question contemplates how current Dutch open data sources promote trust

by incorporating recordkeeping principles. To assess the level of the incorporation of

recordkeeping principles in Dutch open data landscape, three different datasets have been

analysed using the assessment form presented in the previous chapter.

The investigation has been primarily been based on the available information for public use;

to identify the relevance of the transparency and presence of all relevant information for

public use , additional information has been solicited from the owner and/or publisher of the

information when not available publicly.

The selection of the three datasets has been based on the level of supposed formalisation of

the creation environment of the data. This has been done based on an analogy of the land

rights continuum. Within the land administration domain the concept of a continuum of land

rights exists10. The continuum of land rights is a concept or metaphor for understanding the

diversity in tenure rights, varying from informal land rights like customary rights to formal

land rights like registered freehold. An analogy can be made with the level of formalisation of

data creation. Data can be created based on formal processes defined by legislation and

executed by government bodies or contractors. Data can also be created on a much less formal

way, for instance based on initiatives of citizens to provide the governments with data they

think might help decision making, policy making or to challenge government decisions. These

are two extremes of a continuum of data creation environments which vary in formalisation,

see figure below.

Figure 5.1 Formalisation continuum of data creation environments

The selection of datasets based on supposed differences in the level of formalisation allows

for a comparison of support of trust components by incorporating recordkeeping principles

between data created in creation environments with supposed differences in formalisation

levels.

The first dataset is supposed to to be well formalised. It is an official governmental dataset,

regulated by law, the official Dutch base registry for buildings and addresses (BAG).

10 https://unhabitat.org/secure-land-rights-for-all

23

The second dataset is supposed to be somewhere in the middle of the formalisation

continuum. It is a dataset created by a several governmental bodies and comprises air quality

measurements. This registration is not regulated by legislation but has certain formalisation

components in it, like strict domain requirements about how air quality should be measured.

The third dataset is supposed to be an informal dataset. It comprises as well air quality

measurements but has been initiated based on crowd initiatives and is supported by official

institutions like RIVM. The information is being published by a private platform, the City

Innovation Platform, which is being used by government bodies as well.

The data itself, the corresponding metadata and the references to other relevant information

related to the data have been assessed based on the information available as open data on the

internet. For each dataset the following aspects are described:

• how and where the data and additional information has been accessed for this research

• description of the objective and content of the data: what are the source data, what is

the published part of the source data, a visualization of this data, who created the data,

for what objective the data has been created

• the assessment of the incorporation and/or presence of the recordkeeping principles

identified in the previous chapter, based on the form which has been elaborated in the

same chapter

• a reflection on the use of the data and the relevance of recordkeeping principles to

support trust, as far as this can be identified

Because not every aspect could be assessed based on the publicly available information, a

request has been sent to the three different providers of the open data to answer the following

questions:

1. Has the metadata that is available for the dataset been provided for by the owner of

the data or is this metadata being generated at the moment that the data is being

incorporated in the publication platform?

2. Is the metadata valid for the dataset as a whole or as well for every individual

information object?

3. Is additional information available about the processing (selection, conversion,

calculations etc) that has been done in transferring data from the source to the

publication platform?

4. Is history of information (and related processing) available?

5. Is the information secured in a way that it cannot be changed any more, not by users

accessing the information from the open data platform or services nor by the

providers of the open data (i.e. has the information been "frozen")?

6. How do you guarantee that the information will be legible and usable in the future: is

this based on for instance conversions of the data or is this based on other methods /

techniques and/or agreements?

24

The answers to these questions by the providers of the information help to identify the

relevance of transparency about the way data has been created, processed and published. The

answers have been incorporated in italic, both in the paragraphs that describes each dataset

and in the assessment form. The next 3 paragraphs describe the aspects of each of the three

datasets that are part of this investigation. Subsequently the three data are compared and

analysed with respect to the presence and/or incorporation of trust components. Finally the

usefulness of this assessment form has been assessed as well.

5.2 The Dutch base registry of buildings and addresses (BAG)

How the data has been accessed for this research and based on what information

To start the investigation, first of all the platform data.overheid.nl has been accessed. This

platform is a registry of available open data that is generated by Dutch governmental

institutions. Registry of open data set is not mandatory so the registry does not contain all

open government data available.

The platform provides the basic information about the base registry BAG. On the platform a

general description is given of the data. References are made to other websites for more in-

depth information. The creation, maintenance and use of this information is defined by law11.

The minimal qualitative requirements are described and are being audited on a yearly basis.12

There is a complete catalogue description of the BAG data available13. This catalogue

describes the requirements for data quality, details about the information objects, its attributes

and its relations.

The actual data that are available as open data can be accessed in several ways:

• visualization with the specific app bagviewer14

• download from the website of the Dutch Cadastre15: this is a paid service for private

companies and private persons. Two versions can be downloaded: a version that

contains the data that were current on a given date, or a version that contains all data

including historical data of the life cycle of the information objects

• use of specific services like WFS16

• use of a REST API’s17 , including a REST API for the use of linked BAG data

• using SparQL18

11 https://www.geobasisregistraties.nl/basisregistraties/adressen-en-gebouwen/bag-wet-en-regelgeving

12 https://www.geobasisregistraties.nl/basisregistraties/documenten/publicatie/2019/06/21/kwaliteitskader-bag-2019

13 https://www.geobasisregistraties.nl/binaries/basisregistraties-ienm/documenten/publicatie/2018/03/12/catalogus-2018/Catalogus-BAG-2018.pdf

14 https://bagviewer.kadaster.nl/lvbag/bag-viewer/index.html# geometry.x=160000&geometry.y=455000&zoomlevel=0

15 https://zakelijk.kadaster.nl/-/bag-extract

16 https://www.nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/1c0dcc64-91aa-4d44-a9e3-54355556f5e7

17 https://data.pdok.nl/bag/api/v1/

18 https://data.pdok.nl/sparql#

25

Description of the objective and content of the data

The Dutch base registry of buildings and addressed is part of a framework of 13 base

registries in the Netherlands19. These base registries form the core of the Dutch governmental

information landscape and are fundamental for both governmental bodies and other parties

that use basic information about buildings and addresses. Within Dutch government, the use

of the information of this registration is obligatory.

The BAG registry incorporates both actual information about addresses and buildings as well

as historical information (addresses that do not exist anymore, buildings of which the

geographical description has been changed over time and of which the old descriptions are

still available).

Dutch municipalities are responsible to maintain this information. The information can be

published as open data by every municipality itself (for instance, the Municipality of Haarlem

publishes this information on their open data platform20). However, the official distribution

and publication of this information is being done by the “Dienst voor het kadaster en de

openbare registers”, the The Netherlands’ Cadastre, Land Registry and Mapping Agency.

They receive daily a copy of all changes made by the Dutch municipalities. These changes are

processed to be held in a national information system (called LVBAG, Landelijke

Voorziening BAG) which is being used for “private” distribution for reuse of this complete

information by institutional users like government bodies.

For the publication of this information as open data a subset is being selected and published

on the platform PDOK (“Publieke Dienstverlening Op de Kaart”). Based on the questions sent

to Kadaster for this research the following was indicated:

the national information system for BAG (LVBAG) is responsible for the metadata of

the dataset as a whole. From the LVBAG a set of mutations is generated on a daily

basis which is being processed within the publication platform PDOK. This is an

automated process. History is not being maintained in the PDOK platform but only in

the LVBAG. Only the municipalities can make changes within their source data based

on the established mechanisms. Applied changes result in new “versions” of objects

maintaining the history (earlier “versions” of objects). Changes are made based on

authorisation of the users and certificates to infringe information security protocols.

Changes in the platforms (LVBAG and /or PDOK) are being discussed, prepared and

executed together with all stakeholders to guarantee sustainable accessibility and

usability of the data.

The following image gives an impression of the data available via the PDOK platform. This

data has been visualized using the application BAG viewer provided by the Dutch Cadastre.

19 https://www.digitaleoverheid.nl/overzicht-van-alle-onderwerpen/basisregistraties-en-

stelselafspraken/inhoud-basisregistraties/

20 https://www.haarlem.nl/opendata/

26

Figure 5.2 Visualisation of the building and address information using the BAG viewer

Assessment of the presence of recordkeeping principles

Based on the publicly available documentation as mentioned in the description of the dataset

and an analysis of the available open data itself, the presence and/or incorporation of

recordkeeping principles has been assessed using the assessment form elaborated for this

research.

27

Table 5.1 Assessment form for BAG data

Minimal available descriptive metadata: can juridical

and administrative context, i.e. actors and the

procedural context, i.e. processes, use of standards

be derived

Is the format of the data described? The format of the data is described partially: the

format of a download is not available online, the

format the webservice that delivers the data, the WFS

(web feature service)21, is described

Provenance General: can the identity of the creator

be identified and verified?

The identity of the formal creator is defined as the

attribute “Bronhouder” of each object. It can also be

derived from the geographical position of the object

within a municipal jurisdiction combined with the

legal responsibility that every municipality has to

create and maintain this data.

Provenance Specific: can the process / methods be

identified that have been used create or derive the

data?

The process / methods can partially be identified:

every municipality has its own specific process for

creating and updating of BAG objects, but the minimal

national requirements create a common base that

defines the fundamental subprocesses required for

the process (some of the information of the

subprocesses is incorporated as source documents

and comprise of the results of the minimal required

procedures/subprocesses like a building permission,

an address allocation or a surveying process)

Are the data primary (no selection and no processing

(except processes of dissociation of personal data)

and no copying (accessing the source)).?

The data are not primary: the data are a copied subset

of the data which are delivered by each municipality

to the LVBAG; the LVBAG has additional data and each

municipality can have more additional data available

in its own source data (for instance the municipality of

Haarlem22).

Topic: what does the data represent; do the data have

a defined taxonomy or ontology ?

The data do have a defined taxonomy. The data

model description with the significance of each

component is available of all available classes and

each of its characteristics23

21 https://www.nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/1c0dcc64-91aa-

4d44-a9e3-54355556f5e7

22 opendata.haarlem.nl

23 https://bag.basisregistraties.overheid.nl/datamodel

28

Purpose: for what purpose has it been created? The purpose of this data can be found in the

legislation related to the base registry. The purpose in

this case is to serve as a unique base information for

several kinds of processes that depend on address

and/or building information (for instance supporting

the processes of registration of change of residential

address of citizens, the process of emission of building

permits or the taxation of real estate).

Location: to what location is the data related? The BAG data are geodata. This means that location is

incorporated in each information object. The relevant

classes within the data set (building, address, public

space) have a geographic description incorporated as

an attribute / characteristic of each individual

instance of the class.

Time: To what time frame is the data related? What’s

the actuality of the data and metadata; does historical

data remain online with appropriate version-tracking

and archiving; is continuous availability guaranteed by

a legal or policy commitment?

The open data source only contains data that is

current. Data that is historical is available in the

LVBAG and in the municipal source registries. The

relevant classes (entity and graph) have an indication

of BeginGeldigheid (start-date of validity) and

EindGeldigheid (end-date of validity). Version tracking

is incorporated in the mechanism of the use of validity

dates (each change of an information object, causes

the old object to be archived and a new object to be

created. Because all these data has to be available

over time, the Dutch Cadastre guarantees the

“archiving“ of the BAG data by continuous conversion

of the LVBAG information system without loss of

information

Quality and uncertainty: is a description of

correctness, completeness, and precision/granularity

present?

There is no direct description of the correctness,

completeness or precision/granularity within the

metadata. The data is supposed to comply with the

regulation regarding correctness, completeness and

precision that are defined by law and which is being

audited once a year within each individual

municipality based on the official quality

requirements. This regulation in combination with

auditing reports24 can be considered as metadata

related to quality and uncertainty.

What other non-domain specific metadata are

available that support trust components?

There are no other specific metadata identified that

support trust components

24 The auditing reports are not published as open data but can be consulted at the Ministry of Interior

29

Are the metadata comprehensive; do the metadata

comply with metadata standards (international

standards, domain standards)

The metadata to describe the dataset depend on the

way you access the data; in this research the WFS has

been used; the metadata description of the BAG

service25 is compliant with the European Inspire

standards/ ISO 19119 and this is validated based on

the ISO standard 1913926; the metadata within the

data itself are domain specific for this base registry

and can be derived from the BAG data model

Repository: accessibility, integrity and preservation

Are the data findable and accessible : How can the

data be found (portal, search engine), how can the

data be accessed (API, linked data/URI, (download)

portal), which commonly owned or open formats

(including machine readable) are used

The data can be found using data.overheid.nl,

georegister.nl and pdok.nl (and as well in open data

portals of individual municipalities that may

incorporate richer information than the published

data that at national level). The data can be accessed

using standardized services which are machine

readable, API’s and can be downloaded as well based

on open formats like XML.

Is the data is accessible from its source information

system or a recognized repository for publishing

The PDOK environment is a recognized repository for

publishing. It contains copies of data that reside in

their own, closed source information systems. The

copies are a subset of the data in the source

information systems.

Are technologies being used that guarantee the

integrity of the source information system or

repository: securing that the information at the right

moment has been “frozen” and therefore cannot be

changed anymore?

Changes to the repository are executed by automated

mechanisms that process the changes identified in the

centralized source information system LVBAG.

Changes in the LVBAG can only be done by

municipalities, using their specific software and

protocols that update the LVBAG repository based on

the changes of the information in their local

information system. Changes to information always

lead to the archiving of the old object and the

creation of a new object (in the municipal information

system and the LVBAG), in this way changes to the

information are being tracked and stored, but these

are not available in the open source environment.

25 https://www.nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/1c0dcc64-91aa-

4d44-a9e3-54355556f5e7?tab=inspire

26 ISO 19139 provides the XML implementation schema for ISO 19115 specifying the metadata record format

and may be used to describe, validate, and exchange geospatial metadata prepared in XML

30

Does the information system or repository

incorporate technology of preservation to guarantee

readability and usability in the future? Are the records

(metadata and data) available in a sustainable format

(5 star model of Tim Berners Lee

(http://5stardata.info/en/) : Open licence, not

structured or open format/ machine-readable

structured data / non-proprietary format/ open

standards from W3C / linked open data)

The repository guarantees readability and usability by

using conversion of the data when changes are made

to the technical platform guaranteeing that no data is

lost.

Data remains readable and legible because of the use

of services that comply to international standards like

WFS and recently the incorporation of Linked Data.

Because of the use of WFS and linked data the data is

machine readable.

The license is open: CC-0 (1.0)

A reflection on the use of the data and the relevance of recordkeeping principles for trust.

The first reflection is related to the principal question whether the open data has a level of

trustworthiness to be able to serve in legal contexts. This is the case for the BAG data.

Because of its legal foundation and an ample availability of metadata this information is used

as such (although for instance the actual creation and processing of the information of the

source is not verifiable because the actual processes within the municipalities are not

described).

The (re)use of BAG data is obligatory for Dutch government bodies. The use of the official

distribution mechanism LVBAG and the use of the PDOK requests for information is being

monitored and also being evaluated. Figure 5.3 shows an infographic that presents the

evaluation of the use of BAG information in 2018 by the different types of users of BAG

information. Of the respondents 61% indicate that their own products become more reliable

with the (re)use of BAG information27. This is an indication of the trust users have in this

data.

The report and infographics show details about content based trust, reporting how users rate

the quality of the provided data. There are mechanisms in place to report supposed incorrect

content of the BAG data. This reports are being investigated and information is being

corrected if applicable. This mechanism helps to support the trust in the content of the (open)

BAG data. Special platforms exist for users to share information about the use of the BAG.28

These platforms serve as input for future changes to the data structure, information systems

and information services. This participation of stakeholders supports the trustworthiness of

the BAG ecosystem as well.

27 https://www.geobasisregistraties.nl/binaries/basisregistraties-ienm/documenten/rapport/2018/08/06/rapportage-bag-tevredenheidsonderzoek-afnemers-2018/Rapportage+Afnemersonderzoek+BAG+2018+2.0.pdf

28 https://imbag.github.io/praktijkhandleiding/

31

Figure 5.3 Infographic on the re-use of BAG data

5.3 Air quality measurements by Dutch institutions

How the data has been accessed for this research and based on what information

The basic information about the registry of air quality data can be found and accessed using

the platform www.luchtmeetnet.nl On the platform a general description is given of the open

information that can be accessed and about the different data-providers of air quality

measurements (the Dutch institute for public health (RIVM), the ministry of infrastructure and

water-management (I&W) and 5 regional institutions that measure air quality as well).

References are made to the individual websites of each institution for more in-depth

information. The creation, maintenance and use of this information itself in not defined by

Dutch legislation. Nevertheless the Netherlands has to comply with European requirements

for air quality and agreements on lowering contamination. To monitor this compliance and

verify that actions taken have sufficient impact on air quality this network of air quality

measurement stations has been established. These measurements have to comply with

European regulations, the European air quality directive 2008/50/EG29. The regulations

indicate how many stations for measurements are required (related to population density and

level of air quality) and how the quality should be measured.

29 https://eur-lex.europa.eu/legal-content/NL/TXT/PDF/?uri=CELEX:32008L0050&from=en

32

The actual data that are available as open data can be accessed in several ways:

• visualization of current measurements, using the web platform30

• creation and downloads of reports of (historical) data, using the same web platform31

• use of specific services, including mapping32

• use of a specific API33

Description of the objective and content of the data

Within the European Union countries are obliged to maintain their air quality within certain

limits. To support this compliance with air quality regulation, different networks of stations

exist that measure air quality. The information about air quality is generated by sensors

located within the Netherlands which continuously measure the levels of various air

components like O2, CO2, carbon, etc. The network consists of 95 measuring stations (44 are

national stations managed by the RIVM, 51 are regional stations) The registration itself is not

mandatory, however determining air quality is, as well as compliance with how to measure air

quality.

The objective of this information is:

• to advise the general public if air quality is getting worse, this is a short term goal

• long term analysis of air quality to see whether policies do work, which relations exist

with public health and other relevant aspects related to air quality

• to comply with European regulations to inform about the compliance on air quality

norms

Since 2014 the Dutch institute for public health and environment (RIVM) has initiated a

platform34 to publish the air quality in the Netherlands. The purpose to publish this

information is on the one side to inform the general public on the air quality over time and to

give access to experts to (re)use the information for data analysis.

The measurements published at luchtmeetnet.nl. There are three different types of data

available on the platform:

• measured values of air components of the current year that are not validated (status35

of this information is not definitive) :

• measured values of air components of the current year that are validated (status of this

information is not definitive )

• measured values of air components of past years that are validated (status of this

information is definitive)

30 https://www.luchtmeetnet.nl/meetpunten

31 https://www.luchtmeetnet.nl/rapportages

32 https://geodata.rivm.nl/geoserver/wms?

33 https://api.luchtmeetnet.nl

34 https://www.luchtmeetnet.nl

35 This indication of status definitive or not definitive is not clearly indicated on the web platform but has

been provided by RIVM after consulting them directly

33

A clear statement is made about the validity of this information. The measurements published

at luchtmeetnet.nl are done with official reference devices. These devices are scientifically

viable and are calibrated. Nowadays a lot of cheaper sensors are available and are being used

to measure air quality (see example in the next chapter of this research and the reference

website for crowd sourced measurement of air quality36). The results are not comparable.

The current measurements (these are measurements that are executed and are registered in the

current year) are measured with reference devices but are not validated yet. They are

published instantly and are available using the map interface on luchtmeet.net. or as not

validated averages per hour via the API api.luchtmeetnet.nl. Not validated data are available

as well by using a specific API37, but this is limited to averages per hour. Raw data (each

individual measurement) is not available via luchtmeetnet.nl.

Data is also available as “reports”. It is stated that these data are validated reference

measurements, which means that measured values are post processed to get the information

that is being published. Therefore this information can be used for quantitative analyses of

concentrations of air components.

After contacting RIVM, they provided some additional information about the metadata

registration and de post-processing of measurements. The platform doesn’t provide metadata

about the used sensors, however they are registered38. The metadata on how the post-

processing of measurements take place is limited to a general description of the 4 principal

steps:

1. a direct control by a computer which verifies whether the sensor has been functioning

correctly (this is related to temperature and pressure requirements); all measurements

published on the website have passed this direct control

2. a monthly control by humans; detailed values of stations are compared with values of

other stations and values of different components of the same station are compared;

humans can interpret “strange” combinations that might be indications of correct

measurements in itself but incorrect conditions which might lead to rejections of the

measurements

3. a control of the monthly averages to get insight at a more generic level of possible

(minor) systematic changes; these can be noted because random deviations are

deleted by averaging.

4. A yearly control: a comparison of the yearly averages of the components with the

averages of earlies years to find irregularities (i.e. bigger changes than expected);

control of the standards of calibration for the components in order to define

correction parameters for the sensors or the software

36 www.samenmetenaanluchtkwaliteit.nl.

37 https://api.luchtmeetnet.nl/open_api.

38 Consulting RIVM they indicate that the national network for measuring air quality contemplates an ample

registry system for registering additional (meta) data that contribute to verify, validate and publish the

measurement values but this information is not accessible via luchtmeetnet.nl

34

For this research RIVM was consulted additionally whether more information is available

about the post-processing and about the metadata that is being registered. The following

description was given by RIVM:

“considering the network for measuring air quality which is responsibility of RIVM the

following metadata is being acquired, stored and related to the individual measurement

values:

• measurement station: there are regular visits to revise the state of a measurement

station and it’s direct environment. If there are relevant changes that might influence

the measurements and that therefore have to be taken into account with the post

processing, this is registered, reported and stored in the information systems to be

used in the validation process of measurements (changes that might effect

measurements are for instance changes in vegetation, building activities near stations,

road closures).

• Sensors within the measurement stations: the national network for measuring air

quality (Landelijk Meetnet Luchtkwaliteit, LML) measures air quality conform or

equivalent with the methodologies which are established within European directives39;

all sensors have unique identification numbers; maintenance, repair, control, testing

and calibration40 of sensors is being registered within the information systems to be

taken into account for post-processing.

• Measured values: every measured values is related to a sensor and to the

corresponding station. Every measured values has a starting time and an end time.

These raw values (every minute, every 15 minutes) are aggregated within the station

to an average value per hour. However, the original values are being stored (since

2012). This allows for post processing based on the original values. Besides the actual

value of an air component, additional metadata is being measured and stored related

to for instance the quality of the measurement itself, the temperature within the station

and other relevant parameters. This information is used within the validation process

of the individual measurements and can lead to decline or to mark the measurements.

This is used as a filter as well to not publish values on luchtmeet.net that are definitely

wrong because of, for instance, an error of the sensor. The measurement of some

component require calibration afterwards (for instance fine dust). This calibration

can lead to recalculation of values. Original data stays unchanged.

• Validation of measured values: validation is a complex automated and human revised

process in which already marked values (because of erroneous sensors for instance)

and newly marked values (because of extreme values form instance) are used to

compare with surrounding stations, additional information about the measurement

39 2008/50/EC and 2004/107/EC 40 Regularly sensors are being calibrated. This is merely an automatic process. Information about the

calibration is stored within the information systems. If a calibration leads to doubts about the measured values, these values will be marked and/or declined. Calibration failure can often be diagnosed and repaired.

35

station, the environment of the station, information about the sensors and the expert

judgement; values can then be approved or declined; the reason of approval or

decline is being stored and related the corresponding value(s). The validated status is

being registered and communicated on the platform

• Changing status from not definitive to definitive: validated data is getting the stats

definitive after the 1st of January of the following year (until then data has the status

preliminary); reason is the application of calibration a posteriori (a comparison of

used methodologies for measurement with a reference methodology); this can only be

done after all measurements of one year have been received and validated.

Corrections are shared with the public after the status has been changed to definitive

• Corrections of definitive data; because of new insights sometimes it is necessary to

recalculate definitive data. Because this contemplates changes to official data (which

has been communicated tot the EU as part of the compliance with air quality

regulations) this is referred to as an infringement procedure. It is registered when,

why and how the corrections have been executed. Calculated data will be

recalculated, normally original values are not changed (technically these data are

visible by means of views with the new parameters), however in extreme cases this

might be needed. If this is the case, the original value is archived and a second version

of the value is created. Corrections of definitive data are communicated as well via

the platform”

This detailed additional information by RIVM indicates that a large amount of metadata is

being acquired and registered to support the quality of the measurements and therefore the

trustworthiness of the information. It is clear as well that this complex mechanism of post-

processing of data that is or might be corrected implies that reports concerning the same

information that is generated today be may different form the reports that were generated a

year ago.

The historical data comprises the period from 2009 until 2020. Data older than that does not

seem to be available as open data. Using the API which is provided to select data from the

publishing platform does not indicate the availability of a broader set of information in time.

The next figures show a visualization of the actual measurements and reports with validated

data.

36

Figure 5.4 Visualization of actual measurements of fijnstof (PM10) at June 6th 2020, 14:55

hours

Figure 5.5 Report of historical measurements of fine dust (PM10) on August 6th 2018

Assessment of the presence of recordkeeping principles

Based on the publicly available documentation as mentioned in the description of the dataset

and an analysis of the available open data itself, the presence and/or incorporation of

recordkeeping principles has been assessed using the assessment form elaborated for this

research.

37

Table 5.2 Assessment form for air quality measurements by Dutch institutions

Minimal available descriptive metadata: can juridical

and administrative context, i.e. actors and the

procedural context, i.e. processes, use of standards

be derived

Is the format of the data described? The format of the data can be selected/defined when

a report is being solicited that can be downloaded.

There is also a web mapping service available41 but

this returns information formatted in images. The

format of the data within the publishing platform can

be partly derived from the API definition42 (only that

data that is available by means of the API).

Provenance General: can the identity of the creator

be identified and verified?

The identity of the creator can be identified based on

the unique identification of each sensor. Additionally

the organisation that is responsible for that station is

available. Nevertheless it cannot be validated that the

information presented as being generated by the

sensor indeed has been generated by that sensor.

This is being stored in the source information system

of the organisations that provide the data to the

publishing platform.

Provenance Specific: can the process / methods be

identified that have been used create or derive the

data?

The process / methods cannot be identified based on

the publicly available information on meetnet.nl.

However, internally a description of the

characteristics of the sensor (which are available), the

calibration of the sensor (which does not seem

available as open data) and mechanisms for pre and

post processing that lead to the published values exist

(these are only described and are available, but have

been shared for this research by RIVM in a more

general way).

Are the data primary (no selection and no processing

(except processes of dissociation of personal data)

and no copying (accessing the source)).?

Whether the data can be considered primary or not

depends on your point of view: the raw data are not

published but averages per hour in the current year

(not validated and validated) are, and from the 1st of

January also definitive data of all previous years (in

the form of reports). The validated data can be

considered as primary because of the authority of the

organisations which are measuring and publishing

them.

41 https://geodata.rivm.nl/geoserver/wms?

42 https://api-docs.luchtmeetnet.nl/?version=latest#release-notes

38

Topic: what does the data represent; do the data have

a defined taxonomy or ontology ?

Yes, the data do have a defined taxonomy. At a

certain point in space and time, several physical and

chemical measurements take place. Location, time,

and raw values are registered and post-processed to

reflect values that are the basis to indicate air quality.

Purpose: for what purpose has it been created? The purpose is to have a scientific basis consisting of

measurements of the presence of air components

that are an indication of the air quality. From that 3

types of uses are derived.

Location: to what location is the data related? All individual measurements are related to a specific

location of the measuring station.

Time: To what time frame is the data related? What’s

the actuality of the data and metadata; does historical

data remain online with appropriate version-tracking

and archiving; is continuous availability guaranteed by

a legal or policy commitment?

All individual measurements are related to one

specific moment in time (with a start time and end

time of each individual measurement); historical

published data remain online; consulting RIVM reveals

that in the source information system version tracking

and archiving is done related to pre- and post-

processing. Continuous availability is not guaranteed.

Quality and uncertainty: is a description of

correctness, completeness, and precision/granularity

present?

There is no explicit description of correctness or

completeness of the information; measured values

can be changed based on a validation process (pre-

and post-processing procedures) which is described in

a general way but exact mechanisms and algorithms

are not provided. The precision/granularity can be

derived from the different domains that have been

defined for every attribute that is registered which is

also related to the European directive on air quality

measurements. Transparency on the validation

mechanisms can support an assessment on quality

and uncertainty.

What other non-domain specific metadata are

available that support trust components?

There are no other non-domain specific metadata

identified that support trust components; domain

specific metadata are described in the data model.

The procedures for measuring and post processing are

not provided in detail.

Are the metadata comprehensive; do the metadata

comply with metadata standards (international

standards, domain standards)

There is general description of the dataset at the

publishing platform. The metadata within the data

itself are domain specific.

The vast majority of the metadata is automatically

generated and therefore can be considered as

metadated by design.

39

Repository: accessibility, integrity and preservation

Are the data findable and accessible : How can the

data be found (portal, search engine), how can the

data be accessed (API, linked data/URI, (download)

portal), which commonly owned or open formats

(including machine readable) are used

The data can be found using the general platform 43 or

a specific API44. Parts of the data can be found as well

using the facilities provided by the individual

organisations that are responsible for the

measurements45. The data can be accessed using the

website form to select and download, can be used

(based on CSV format).

It is explicitly stated that availability of the data is not

guaranteed: datasets can be changed, datasets can be

removed, or access to the platform can be removed

without notification. The API can be used based on

Fair Use Policy, but continuous use is not guaranteed,

it can be blocked or removed.

Is the data is accessible from its source information

system or a recognized repository for publishing

Based on the luchtmeetnet platform it is not clear

whether the data is accessed from its source

information system or from the publication platform.

The publication platform is an environment provided

by recognized governmental organizations in the air

quality domain. Consultation of RIVM has learned that

these environments are different. In the source

information systems raw values are archived as are all

calculated measurements that are recalculated based

on new parameters.

It is clear that the accessible data is a subset of the

data in the source information system: the data on

luchtmeetnet.nl have passed through extensive

quality control (except the actual values that are

published “as-is” before validation except when the

values are flagged direct after measurement as

erroneous). The amount of metadata linked to this

validated data is numerous and a lot of them require

domain knowledge to be able to interpret them).

43 https://www.luchtmeetnet.nl

44 https://api.luchtmeetnet.nl/open_api.

45 For instance Amsterdam: https://api.data.amsterdam.nl/dcatd/datasets/Jlag-G3UBN4sHA

40

Are technologies being used that guarantee the

integrity of the source information system or

repository: securing that the information at the right

moment has been “frozen” and therefore cannot be

changed anymore?

The data in the publishing repository cannot be

changed from the outside (by the the users of the

platform). From the inside (by the organisations that

publish the data) it can:

Raw data is being processed and new validated data is

derived from that. Post processing is not done once

but several times based on new measured data

(based on the 4 steps by machine correction, human

verification and correction, monthly averaging and

correction and yearly averages and correction). This

post processing might lead to new measured values.

Old values will be deleted in the publication platform

and new values are published. Consultation of RIVM

has learned that the previous versions of information

objects are being archived in the source information

system, but these versions are not available as open

data.

Does the information system or repository

incorporate technology of preservation to guarantee

readability and usability in the future? Are the records

(metadata and data) available in a sustainable format

(5 star model of Tim Berners Lee

(http://5stardata.info/en/): Open licence, not

structured or open format/ machine-readable

structured data / non-proprietary format/ open

standards from W3C / linked open data)

It is not clear whether the repository has preservation

technology. It is not clear for how long data is being

archived nor is it clear whether the platform will stay

available for the future.

Consultation of RIVM shows that all stakeholders

strive to be consistent over time concerning data

model, measuring methodologies, post processing and

publishing which should help to guarantee usability in

the future. Conversions to the publishing platform are

prepared and executed with the stakeholders to

guarantee readability in the future.

The data available on www.luchtmeetnet.nl is

provided under the licence CC BY-ND 4.0

A reflection on the use of the data and the relevance of recordkeeping principles for trust.

The official distribution mechanism is the platform that has been used for this research as

well, luchtmeet.net and the related API. No statistical data has been found on the (re)use of

this information The measurement information is used by government for verifying

compliance with international air quality norms and trends in air quality Therefore it is

important that the measurements are of high quality. There are mechanisms in place to

support the quality of the information presented. However, details are not given on the

publication platform, which leaves this processing as a black box for the user of the open data.

After consulting RIVM it has become clear that the data that has been made definitive (data of

2019 and older) can be used for legal purposes (for instance in juridical processes related to

emitting levels of air components that do not comply with legal limits of emissions). The

measurements in the current year (2020) cannot be used as such, because they are still

preliminary, although validated, and might still change.

41

It is not clear who is (re)using the measurement data. RIVM indicates, after consulting, that

they do not know who is (re)using the information: the API and download functionality is

freely available, without registration. Therefore it is not clear either for what goal data is

downloaded. The only indication they have are page counts of web URL’s.

The final reflection is related to the principal question whether the open data has a level of

trustworthiness to be able to serve in legal contexts. This is the case for this air quality

measurements, because of the legal role of RIVM.

Detailed additional information provided by RIVM indicates that a large amount of metadata

is being acquired and registered to support the quality of the measurements and therefore the

trustworthiness of the information. However, this metadata is not provided in detail and

therefore verification is not possible. This undermines the social- and content based trust.

Therefore private initiatives have started measuring air quality, sound levels and other

relevant environmental values that are regulated by law to verify these official measurements.

It is clear as well that the complex mechanism of post-processing of data that is or might be

corrected implies that reports concerning the same information that is generated today be may

different form the reports that were generated a year ago. Only transparency on all relevant

data and metadata can support the social based trust of RIVM and the content based trust of

the air quality measurements.

In this case it is clear that the lack of recordkeeping standards for the published part of the

data, related to link data to the context of their creation for proper interpretation, causes

knowledge gaps and therefore affects the trustworthiness.

5.4 Crowd sourced air quality data

How the data has been accessed for this research and based on what information

On the governmental platform data.overheid.nl data are published about air quality that are

not measured by recognized institutions of governmental bodies but that are crowd sourced.

For this research the raw air quality data captured by bike riders with air quality sensors

(sniffer bike) have been assessed46.

On several public sites information is being given about this initiative:

• the website of the initiative itself47

• the website of Civity, the organisation that manages the data that is being generated48

• the website of RIVM that recently started to post processes the information49

The data is being stored, managed and published on the City Innovation Platform, hosted by

the Dutch organisation Civity. The data can be accessed by using a specific application or

46 https://data.overheid.nl/dataset/raw-snifferbike-snuffelfiets-data

47 https://www.snuffelfiets.nl

48 https://civity.nl/products-solutions/snuffelfiets/

49 https://www.rivm.nl/nieuws/500-fietsen-gaan-luchtkwaliteit-meten

42

dashboard50 or by downloading the data from data.overheid.nl. A definition of the data (data

dictionary) is available on the platform.51

Besides the publication of this raw data, the Dutch institute of public health RIVM recently

started to publish a post-processed version of this information. This information can be

accessed by downloading it from data.overheid.nl52.

Description of the objective and content of the data

The sniffer bike initiative is an experiment to obtain more knowledge, data and information

about cycling with the objective to better the incorporation of cycling in mobility policies.

Until recently a lot of information was available about car mobility but a lot less about

mobility based on bicycles. The sniffer bike experiment parts from the idea that cyclists can

contribute to solve several societal problems related to traffic congestions, air-quality, public

health, participation and sense of happiness. New technologies cause changes in the mobility

ecosystem related to cycling, like the rise of the use of electric bicycles and the use of media

devices connected to bicycles. The province of Utrecht has the ambition to be the frontrunner

in acquiring new knowledge about these changes.

In this experiment the focus is on using bicycles and cyclists to execute measurements while

cycling. It is designed to acquire at large scale mobile data by citizens who volunteer to

participate in the experiment. It is a public private partnership in which also participate the

IoT company SODAQ for the production of mobile sensors and Civity for the data

management. Recently RIVM started to support the experiment by validating the acquired

data.

The project has started about a year ago and is executed in different phases to be able to learn

about and anticipate on the uncertainties in this kind of innovative experiments (which depend

on new technologies, human behaviour, weather conditions etc). The experiment has started

with a small number of volunteers measuring a limited number of air components (those that

were topical for that moment, especially the level of fine dust in relation to air quality for

cyclists and public health). Based on the measurements, “green” (more healthy) cycling

routes can be identified for cyclists to choose from. Governments will get the basic

information available to take measures to improve routes that are considered not healthy.

After a period of testing in 2018 the activities started in 2019 with the participation of 500

cyclists. With the sensor made by SODAQ, real time data is being transmitted via

narrowband (later on there will be a switch to LTE-M) directly to the data platform of Civity

(see figure 5.6).

50 https://dashboard.dataplatform.nl/sodaq/v2/groene_fietsroutes.html

51 https://ckan.dataplatform.nl/dataset/8b04f4f3-666c-4448-91fc-234d5a75e6c4/resource/1d914417-3a80-

4120-9e2e-e4c918ee67a5/download/datawoordenboek_snuffelfiets.csv

52 https://data.overheid.nl/dataset/rivm-corrected-snifferbike-snuffelfiets-data

43

Figure 5.6 Data acquisition, processing and publication of sniffer bike data 53

Technically Sniffer Bike54 is a mobile solution that is mounted to the bicycle steer. The

published data contemplates the raw data measured by “sniffer bikers”. Each week on

Monday a file collected in the previous week is published in this dataset. This data has not

been validated by the RIVM. To protect the privacy of cyclists the reference to the IMEI

number of sensors (the unique identification of the sensor) has been removed. The figure

below shows the visualisation of the open data within the application.

Figure 5.7 Visualisation of sniffer bike open data with specific web application

During this research project, post-processed data became available separately55. The post

processing is done by RIVM. The dataset contains the RIVM corrected measurement data

collected by the snifferbikes. Every week a collection is made of the data collected in the

previous week and published here. However, verifying the content of these data (data of

verification 14th of June 2020) resulted in empty datasets.

53 https://civity.nl/products-solutions/snuffelfiets

54 https://civity.nl/en/sniffer-bike/

55 https://ckan.dataplatform.nl/dataset/rivm-corrected-snifferbike-snuffelfiets-data

44

The province of Utrecht is financing the sensors to be used by the volunteers and Civity

provides for the data management part. Besides the measurements themselves, the experiment

has the additional objective to define what the minimum amount of measuring cyclists has to

be to obtain viable results and what conclusions can be drawn from these results. The

experiment is ongoing in 2020 and will end in December 2020. On the 2nd of June 2020 a

webinar was organised to present the results obtained until June 202056. In essence three

conclusions were made:

• measuring fine dust (making use of the Sensirion SPS30 sensor) works good for

PM2.5 but doesn’t work for PM10

• the sensors need frequent calibration

• whether there are real differences in air quality between routes cannot be concluded

yet (one of the reasons is that on many days there are too few measurements to be able

to make significant statements for the whole of the province Utrecht)

Some additional information has been solicited from Civity about how metadata is provided

and how data is managed within the CIP platform. Civity has indicated that the meta data

related to the content of the data is being provided for by the owner (in this case the

volunteers that measure the data by means of automatically generated metadata by the

sensor); the metadata related to the publication of the data (date of publication, date of

updating, number of data sources etc.) are being generated and/or updated automatically by

the publication platform. Before publishing the measured data, simple post processing takes

place that consists of the conversion of the raw data from the sensor to corresponding

interpretable values of the measured air components. Additionally deselecting takes place of

data that might infringe privacy regulations. The CIP platform keeps historical data, however

information has not been locked or “frozen”. It can be changed from the outside and/or the

inside. The repository contains the raw data to prevent any loss of information if conversions

of the repository are needed. Changes in technologies and/or methodologies therefore should

not have any effect on the readability and usability of the data.

Assessment of the presence of recordkeeping principles

Based on the available documentation as mentioned in the description of the dataset and an

analysis of the available open data itself, the presence and/or application of recordkeeping

principles has been assessed using the framework elaborated for this research.

56 https://snuffelfiets.nl/wp-content/uploads/2020/06/Webinar-2-juni-2020.pdf

45

Table 5.3 Assessment form for crowd sourced air quality data

Minimal available descriptive metadata: can juridical

and administrative context, i.e. actors and the

procedural context, i.e. processes, use of standards

be derived

Is the format of the data described? The format of the data is described and explained: for

every measurement the following is being emitted

and stored: EIM (identifier of the sensor,) unique

identifier of the measurement, trip-sequence, values

for air components (PM10,PM2.5,PM1.0, Volatile

Organic Compounds), atmospheric condition

(Temperature, Humidity, Barometric pressure), GPS

location (latitude & longitude), Date and time,

Accelerometer (indicates irregularities in the road)

Provenance General: can the identity of the creator

be identified and verified?

The identity of the creator cannot be identified

because the unique identification of the sensor that

generates the information is removed for compliance

with GDPR (although the publication of the

geographic location in combination with time-stamps

might give the possibility to unique identification of

persons57)

Provenance Specific: can the process / methods be

identified that have been used create or derive the

data?

The process / methods can partially be identified: a

description is given how data is captured, transmitted

and stored within the environment from which it is

published as well58; Civity has provided some general

information about the post-processing of data but not

all details were given.

Are the data primary (no selection and no processing

(except processes of dissociation of personal data)

and no copying (accessing the source)).?

The data can be considered primary although some

minimal processing takes place: original data is being

stored in the sensor and copied/transmitted to the

CIP platform. It is stated that the data are raw data,

but identification of the sensor and the information of

the starting part of the route are removed because of

privacy reasons. Raw values are converted to

interpretable values of air components

Topic: what does the data represent; do the data have

a defined taxonomy or ontology?

The taxonomy is described in the explanation that

exists how to interpret the sniffer bike data59;

57 The location and time-stamp of every measurement can be combined with for instance available mobile

phone data (which records location and time-stamps as well and might have the identity of the user

connected to it)

58 https://civity.nl/products-solutions/snuffelfiets/

59 https://ckan.dataplatform.nl/dataset/8b04f4f3-666c-4448-91fc-234d5a75e6c4/resource/1d914417-3a80-

4120-9e2e-e4c918ee67a5/download/datawoordenboek_snuffelfiets.csv

46

Purpose: for what purpose has it been created? The purpose is described in the general description of

the dataset that can be found in the data.overheid.nl

environment60. Its objective is to provide insight into

air quality, cycling routes and urban heat islands.

Location: to what location is the data related? All individual measurements are related to a specific

location of the measuring station at a specific

moment in time by registering the geo position based

on GPS data.

Time: To what time frame is the data related? What’s

the actuality of the data and metadata; does historical

data remain online with appropriate version-tracking

and archiving; is continuous availability guaranteed by

a legal or policy commitment?

All individual measurements are related to 1 specific

moment in time

Quality and uncertainty: is a description of

correctness, completeness, and precision/granularity

present?

There is no description of correctness, completeness

or precision/granularity within the metadata. This

depends on the quality of the sensors and eventual

pre- and post-processing procedures. The sensors

which have been used can be identified, but

calibration reports do no seem to be available. Post

processing procedures are not described in detail to

derive quality and uncertainty. However, the

validation mechanisms of RIVM can support an

assessment on quality and uncertainty. They have

been communicated in the webinar of 2nd of June

2020.

What other non-domain specific metadata are

available that supports trust components?

There are no other specific metadata identified that

support trust components.

Are the metadata comprehensive; do the metadata

comply with metadata standards (international

standards, domain standards)

The metadata of the CIP platform are based on de

DCAT standard; therefore this information is

automatically available in the national registry of

open data (data.overheid.nl) and the EU portal.

The metadata within the data itself are domain

specific for this registry. They are automatically

generated and therefore can be considered as

metadated by design.

Repository: accessibility, integrity and preservation

Are the data findable and accessible : How can the

data be found (portal, search engine), how can the

data be accessed (API, linked data/URI, (download)

portal), which commonly owned or open formats

(including machine readable) are used

The data can be found using data.overheid.nl and is

accessible from the CIP platform. The data can be

accessed using a specific app, services and API’s and

can be downloaded as well.

60 https://data.overheid.nl/dataset/raw-snifferbike-snuffelfiets-data

47

Is the data is accessible from its source information

system or a recognized repository for publishing

The data is directly accessed from its source

information system, the City Innovation Platform,

assuming that the individual sensor is not considered

a source information system.

Are technologies being used that guarantee the

integrity of the source information system or

repository: securing that the information at the right

moment has been “frozen” and therefore cannot be

changed anymore?

The users of the City Innovation Platform are

supposed to have different levels of authorisations

that permit them to add, change and publish

information. Civity indicates that information can be

changed after is has been published as open data.

Does the information system or repository

incorporate technology of preservation to guarantee

readability and usability in the future? Are the records

(metadata and data) available in a sustainable format

(5 star model of Tim Berners Lee

(http://5stardata.info/en/) : Open licence, not

structured or open format/ machine-readable

structured data / non-proprietary format/ open

standards from W3C / linked open data)

It is not clear whether the repository has preservation

technology. It is not clear for how long data is being

archived nor is it clear whether the platform will stay

available for the future.

The data is provided under the licence CC-0 (1.0)

The data can be downloaded in CSV format.

A reflection on the use of the data and the relevance of recordkeeping principles for trust.

The first reflection is related to the principal question whether the open data has a sufficient

level of trustworthiness to be able to serve in legal contexts. This seems not to be the case for

the sniffer bike data although the data are post-processed by RIVM. Although the questions

related to this were not answered by CIP, from the context of the pilot project can be derived

that because of the quality of the sensors they can be considered helpful for additional

measuring which is at higher density then the official measurement stations. This helps at this

moment in time mainly to define or detail policies, allow for additional input in air quality

models and for scientific research on the uses technologies and methodologies.

Statistical information about the (re) use of the data could not be found. Based on the webinar

held on the 2nd of June can be concluded that the data at least is being (re)used by the

stakeholders of the experimental project itself. Civity provided some additional information

indicating that it is counted how many time the CKAN webpage has been consulted, but that

isn’t related to consulting and/or downloading data itself. The same counts for the use of the

API, it does not monitor the (re)use of data.

The cyclists (re)use the information to assess the quality of the information and raise

questions about the usability of the information; the transparency about the process from

creation to publication helps to assess the usability of the information (sensor characteristics

and processing determine content, quality and trustworthiness) but has to be combined as well

with the specific domain knowledge about air quality (measurements). They might use the

information as well to define whether they want to cycle on a specific moment, considering

the levels of fine dust, and what routes they choose.

48

The participating provinces reuse the information for compliance with their program to better

air quality (measurement with sensors by citizens is part of the program

“Uitvoeringsprogramma Schone Lucht” of the province of Utrecht61). They want to use the

information to develop green(er) bicycle routes but also to use this information in a more

general way to develop policies for the design of the living environment, to increase the

involvement of citizens with these policies. Transparency and trustworthiness are key for this

involvement.

The raw sniffer bike data is being (re)used by the RIVM to process new information which is

validated. It is not clear what corrections are being made exactly, but it is assumed that

correction is based on comparison with other reference measurements which should lead to

higher trust related to the validated data. After validation and corresponding correction these

values are used by RIVM as an addition of their official measurements done within their own

network of stations in Utrecht (6 stations in Utrecht) to support scientific research to

determine the variation of air quality depending on geography (city, village, rural area),

determine and fine-tune calculation models and provide additional information for the

Provincial Report about air quality62. The validation of RIVM is an important part to improve

the trustworthiness of the information related to content-based trust.

5.5 Analysis of the research data

In the previous three chapters several data sources have been assessed on their incorporation

of recordkeeping principles. The three data sources were chosen based on their difference in

legal context:

1. a data source that is created by governmental bodies and is maintained and published

based on legislation on the registry itself

2. a data source that is created, maintained and published by a recognized governmental

organisation but the creation of the dataset is not based on legislation, however a part

of the content does have to comply with European directives

3. a data source that is created by private persons, published by a private party on a

platform that is being supported and used by governmental institutions63

In general the assessment of three datasets shows that descriptive metadata vary and lack

metadata that support specific trust components. For instance there is a lack of metadata that

indicate uncertainty of data (which is related to content based trust). Domain (meta) data

models vary and are not always available. Data collection and data processing standards vary

61 This program is a co-creation between the province of Utrecht and the stakeholder (citizens, NMU, the

Dutch lung fund, EBU, research institutes and municipalities.

62 https://www.provincie-utrecht.nl/sites/default/files/2020-

03/rapportage_luchtkwaliteit_provincie_utrecht_2018.pdf

63 During the research project an alternative, post processed version of this information became also

available. Post processing is done by the Dutch institute for public health.

49

and are not always available. With respect to the repositories it can be concluded that two of

the three repositories seem to lack guarantees to secure integrity and accessibility over time.

Below you can find a classification of the results of the assessment between these three

different data sources regarding the level of incorporation of the recordkeeping principles as

defined in the assessment form. Every recordkeeping principles is classified based on the

following criteria:

• Green: the dataset and its accompanying information comply fully with the

recordkeeping principle

• Orange: the dataset and its accompanying information comply partly with the

recordkeeping principle

• Red: the dataset and its accompanying information do not comply with the

recordkeeping principle

• Grey: the level of compliance could not be confirmed because of lack of information

The classification is based primarily on the publicly available information and secondarily on

the information provided by Kadaster, RIVM and Civity.

Table 5.4 Classification of presence of recordkeeping principles in 3 assessed open data sets

Recordkeeping Principle Legislation: BAG Recognized bodies but no legislation: air quality measurements

Private parties: air quality measurements

Format described

Known creator

Known process

Primary data

Known taxonomy

Known purpose

Known location

Known time

Known uncertainty

Additional trust metadata

Compliance metadata standards

Findable and accessible

Source information system or recognized publishing platform

Integrity guarantee

Preservation guarantee

50

The overview of the incorporation of recordkeeping principles shows that the BAG registry

applies the majority of recordkeeping principles. The formal air quality measurements of

RIVM and its partners and the informal sensor data of private parties in general apply less of

the principles, specifically considering the trust components regarding available metadata and

the uncertainty about the incorporation of the repository-related principles. In general this

seems to be an indication that legislation might help to define and secure (by design) a level

of trustworthiness based on the application of recordkeeping principles.

Considering the available general metadata:

• All datasets are described generally with relevant metadata, mostly incorporated in

explanations on the platforms where the data can be found and in related additional

documentation. Two of the three datasets are described in a standardized way in

data.overheid.nl. Considering the metadata of individual information objects,

standardized metadata lack. Depending on the data models, some attributes are

incorporated that can be considered metadata (for instance location and time) but they

are registry and/or domain dependent. It would be of great interest to investigate

whether a standardized minimum set of metadata for each individual information

object can be implemented to support the trustworthiness of reuse of individual

information objects. This is elaborated in the next chapter.

Considering some specific metadata:

• Metadata that have the objective to give detailed information about the provenance

should indicate clearly who is the creator of the information; as is shown in this

research, in the case of the private air quality measurements, publication of this

information as open data can be in conflict with European regulation for privacy.

However, the fact that procedures are transparent, data is emitted directly from the

source to the processing- and publishing platform, metadata is generated

automatically, and the possibility of saving this information (for verification) within

the sensor itself supports the (derived) trustworthiness of the information.

• Primary data: the provenance metadata are directly related to the recordkeeping

principle of primary data and the access of the source information system or a

recognized publishing platform. This principle of primary data minimizes the risk that

changes in metadata about provenance or changes in the data itself can be made

between creation and publication, which would affect the trustworthiness of the

information. Primary data can be supported by direct access to the source information

system

• Metadata that support information about the processes that generated the information

have resulted to be very important to be able to create a combined social based trust

(based on the transparency of an organisation to publish all process details) and

content based trust (verification of (the quality of) the processing and the data based

on the available metadata).

51

• Known uncertainty and additional trust metadata: in all three datasets detailed

information lacks to be able to assess the quality of the information; some information

is publicly available, some information was given with more detail, but a complete

detailed description of processes, parameters, algorithms and/or human decision

details have not been found or provided; this definitely affects the trustworthiness of

the information; on the other hand this is compensated in all three cases by several

protocols to give feedback on the quality of the content and/or the possibility of

public/private co-creation of data

For all of these metadata reflections, a disclaimer is very important: none of the metadata is

inseparably connected to the individual information objects (with exception of the metadata

that is incorporated at the level of each individual information object because of its domain

data model). This means that data can be accessed and re-used without taking into account its

original context and significance. This allows for inappropriate reuse, involuntary or

voluntary. This possible inappropriate reuse affects the trustworthiness of the reused

information. When metadata are inseparable from the information object, misinterpretation

and manipulation can be countered or may be even prevented.

Considering the recordkeeping principles of the repository:

• Integrity guarantee: the integrity of the data is difficult to determine. When using apps

and downloads it is impossible to guarantee that the source data is being served

without changes. API’s and services can support to secure that, providing that a

complete data model of all available information is available for the information

within the publishing platform. In that case it is still possible that the information is

not the same as in the source database, neither can it be guaranteed that information

cannot be changed. The trustworthiness can be improved by diminishing the

“distance” between source information system and the repositories for the (re)use of

the open data, the ideal situation being a direct access to the information at the source

without copying (parts of) the information. This can be facilitated by making use of

the principles of linked data and incorporating additional functionality in source

databases. This is elaborated in the next chapter.

• Preservation guarantee: two of the three repository environments (both related to the

air quality measurements) did not have an indication of guaranteeing preservation of

information. Mechanisms of guarding historical information exist, but the legibility

and usability of this information in the future is not guaranteed explicitly. In the case

of the legislated BAG registry however this is obligatory. In all three cases the

organisations strive to maintain legibility and usability in the future by updating

repository environments together with the stakeholders.

These findings of the compliance with recordkeeping principles are directly related to the

assessment form which has been elaborated and used for this research. During the application

of the form several flaws were encountered. These will be briefly described in the next

chapter.

52

5.6 Assessment of the assessment framework

This research has been an attempt to assess current open data sources within a governmental

context. The assessment has been executed with an assessment form based on literature.

The assessment form has served to analyse the open data in a structured way and look at the

implementation of recordkeeping principles. However, it has been difficult to assess all the

aspects with sufficient profoundness. This is mainly caused because of the limited information

available online. In the context of this research the different institutions (creators, processors

and/or publishers) had not been contacted to guarantee that the trustworthiness should be

based on information that is publicly available for the user. However, consultation supposedly

would give additional information valuable for the research and therefore Kadaster, RIVM

and Civity have been consulted with some questions related to the assessment form. This has

definitely resulted in valuable additional information.

It has been possible to analyse the different data sources and data platforms, reviewing the

details that were available online (legislation, manuals, services, API’s and downloaded data).

It has been possible as well to derive conclusions about whether the recordkeeping principles

are present or not and to what level. It resulted not to be practical to have a subdivision at

dataset level, record level and attribute level. Documentation was applicable mostly on the

dataset / data model as a whole, metadata at lower levels were domain dependent.

Two specific aspects were not present in the assessment form and resulted to be very relevant

for the objective of the thesis on the trustworthiness of information:

• can the available information be (re)used to serve as evidence (to serve in legal

contexts), which is an indication of the trustworthiness of the information

• what can be said about current (re)use of the different datasets which also indicates a

level of trustworthiness

Both aspects can and should be added to the assessment form to complete the assessment

from a user-oriented point of view.

The findings of this research have been translated to recommendations that can help to

promote the trust of open data environments and the preceding data governance environments.

These are described in the next chapter.

53

6. Recommendations to promote trust of open data environments

6.1 Translation of the research findings to recommendations

This research parted from two basis assumptions:

• recordkeeping metadata can support the accountability of an organization and support

the traceability, authenticity and sustainability of records; it provides information

about the context of the creation of records needed for proper interpretation and

(re)use

• the record continuum model reflects the (continuous and limitless) reuse of records;

the digital age technology enables and supports this reuse of records; therefore a focus

is needed on the creation, capture, and maintenance of records in a way that

sustainable accessibility is guaranteed. This implies the use of repositories that

guarantee integrity and sustainable accessibility

This research has shown that the availability of recordkeeping metadata can help (re)users of

open data to assess the usability of the open data for their goals. In what context was the data

created, by whom, which processing has been taken place and what is finally published. The

research has shown that metadata is mostly available at dataset level. However, not all

metadata is publicly available. Details about the processes that are used to create and/or

process the data is mostly not available. Most metadata is not available in a form that supports

automated (re)use of data in which the metadata can be taken into account. Some of the

metadata is available for individual information objects as well, but this mostly is related to

domain-related metadata. Metadata are not inseparably connected to each individual

information object. This means that data can be accessed and (re)used without taking into

account its original context and significance. This allows for inappropriate reuse, involuntary

or voluntary. This possible inappropriate reuse affects the trustworthiness of the (re)used

information. Metadata that are inseparable from the information object can prevent

misinterpretation and manipulation. To support the trustworthiness of open data it is

recommended to investigate whether a standardized minimum set of inseparable metadata for

each individual information object can be implemented to support the trustworthiness of

(automated) reuse of individual information objects. This is elaborated in chapter 6.2.

This research found that the sustainability of two of the repositories is not guaranteed. This

research has shown as well that the characteristics of the repository, the processes to publish

information in the repository and the way this information can be accessed determine the

trustworthiness of data as well. All of these aspects affect whether integrity of data can be

guaranteed. Is the source data being served without changes? When it is served without

changes, who can guarantee that, if information objects go to “wander”, that it stays

unchanged and will be used correctly? All repositories contain copies of the source data and

all repositories allow for reuse of information by copying information, separating it therefore

from the repository. The trustworthiness of records can be improved as well by diminishing

the “distance” between source information system and the repositories for the (re)use of the

open data, the ideal situation being a direct access of the information at the source without

54

copying (parts of) the information. This can be done for instance by making use of the

principles of linked data for access to records and incorporating additional functionality in

source databases to guarantee integrity and preservation. This is elaborated in chapter 6.3.

6.2 Standardized and inseparable minimum metadata set for individual

information objects

The research findings indicate that trustworthiness of open data is affected by the lack of

inseparable (standardized) metadata at the level of individual information objects. A

standardized minimum set of inseparable and unchangeable metadata for each individual

information object can help the trustworthiness.

If by design principles are used, a standardized set of metadata, could be used for every

individual information object, when creating and/or changing information objects. The

standardized set could consist of different groups of metadata that support the trustworthiness

of the data and the sensemaking of the data:

• a group of metadata that identifies the information object uniquely: a unique identifier,

the creator, type of information object or classification (based on taxonomy), the

creation date for the information object, the date of fixation (after which changes are

not possible any more), date of the period for which the data is valid etc.

• a group of metadata that helps to comply with legislation: registration of retention

times, confidentiality of the information object, purpose of creation to determine

allowed (re)use (in the case of open data retention time has only significance when the

data is not provided physically but by means of linked data, confidentiality is only

relevant in the source information system to determine whether the information can be

published / linked to)

• a group of metadata that helps to detail the context of creation and the related quality

of the information object: information about the process that created and/or changed

the information object, used methodologies, information about quality and uncertainty,

quality control and corrections

• a group of metadata that specifically supports further sensemaking (apart from the

metadata that are already mentioned that help support sensemaking like creator, type /

classification, purpose, context of creation, methodologies) and might be more domain

dependent: format, topic, taxonomy, location, time and other metadata to support the

sensemaking of data.

This standardized minimum set of inseparable and unchangeable metadata for each individual

information object could be taken into account with the design of every new information

system. It can be based on and integrated with the new metadata standard under development

for sustainable accessibility of governmental information objects (MDTO).

The standardization of those metadata that are domain independent support the

interoperability of data in integral analysis based on that data and automatic processing of

processes like retention, publication, verification of allowed (re)use and other alike processes.

55

The standardization per domain of domain dependent metadata will support the sensemaking

of the data, help to be able to verify the (quality of the) content and enable automatic

processing and analysis (including the use of artificial intelligence). This supports the

trustworthiness of open data (strengthening both social based and content based trust).

6.2 Source information system as archiving and publication platform

The research findings indicate that if information objects are not managed and maintained in

their original source information system or a recognized and managed publication

environment, and copying is allowed, they might go to “wander”. When that happens (and

that is the case with the majority of open data) it is impossible to guarantee that it stays

unchanged and will be used correctly. It is not possible either to guarantee compliance with

legislation regarding retention times, privacy regulation and purpose related (re)use of

records.

All three investigated datasets in this research concern open data that are not residing in the

information system where it was created. This can have been done for several reasons.

Separation of the data that will be published as open data from the other private data might be

needed to provide the open data in an environment that is specifically made for use of open

data. Separation might be needed as well because of different governance rules between the

source information system and the publication information system. The downside of this

separation is the possibility that the information might not be the same as in the source

database (because of changes made during the copying of information, or less regulation in

the publication environment that allows for manipulation of the data).

The trustworthiness of open data can be improved by diminishing the “distance” between the

source information system and the information system used for the (re)use of the open data,

the ideal situation being a direct access to the information at the source without extracting or

copying the information. This can be done making use of services and API’s that directly

access the source database. This can also be supported making use of the principles of linked

data. At the same time it is relevant to incorporate functionality in the source information

system that guarantees the integrity and preservation of information. This would lead to an

information infrastructure in which all data is being managed in its source information system

during its complete lifecycle, from creation to archiving or deletion. Figure 6.1 shows a

possible architecture with management of metadata by design and source repositories that

have recordkeeping functionality from which data can be reused based on linked data

principles.

56

Figure 6.1 Possible information architecture with integral recordkeeping functionality

This concept of publishing information from its source information system (by means of for

instance linked data) helps to better protect the information, better manage the compliance to

legal requirements (in combination with metadata related to GDPR, archival law,

secretiveness of data, open data legislation, “objective of use” regulations) and prevent

physical “wandering” of information. However, currently technical limitations of information

systems impede low level re-use of big amounts of data that are not copied to the processing

platform (consultation, referencing or processing of individual information objects is possible

but to do this for complete datasets with millions of information objects is more difficult, as is

being done within data science environments for instance). Therefore this is a model to grow

to within the coming years.

The “architectural” approach for the design of data models, information processing and

information systems and -architectures allow for the incorporation of recordkeeping principles

that strengthen the trustworthiness of information. Both standardized inseparable and

unchangeable metadata and source information systems with integrity and preservation

guarantees help to reach these goals. This will be stressed as well in the last section of this

research with conclusions.

57

7. Conclusions

The concept of trust has been the key topic in this research. The focus of this research has

been the data that governments generate and open up for reuse. Government bodies are

responsible for the execution of a lot of tasks where trustworthy information is key. For the

execution itself, for the accountability of the execution and for the reuse of information,

information is created, maintained and published that are considered trustworthy.

That Dutch government bodies are considered to be trustworthy and therefore supposedly

produce accurate and reliable information is part of our social culture and related to the

concept of social trust. However, opening up government data in combination with new

individual ways of creating and/or verifying data allow for assessing this social based trust by

assessing the quality of the content of data. The resulting content based trust can support

already exiting social based trust or it can undermine it when the quality is poor or when the

quality cannot be assessed because of lacking transparency. This has been the case in the past

years in several cases, from environmental measurements by government bodies to fraud

profiling and taxation issues. Gillian (2018) concluded that there are many socio-cognitive

factors in play and a reliable source alone does not consistently determine trustworthiness for

users. Yoon (2014) concludes on the trust of data in repositories that “trust in data itself plays

a distinctive and important role for users to reuse data, which may or may not be related to the

trust in repositories”

The objective of this research was to find whether open government data can be trusted based

on recordkeeping principles that support trustworthiness. The research has shown that

available open data do incorporate recordkeeping principles but at different levels and in

different ways. The current Dutch open data landscape is diverse. Governments are opening

up, but do not have everything in place yet to support trustworthiness of information. The

assessment of three datasets has shown that descriptive metadata vary and lack metadata that

support specific trust components, that domain (meta) data models vary and are not always

available, that data collection and data processing standards vary and are not always available,

and that (source) repositories seem to lack guarantees to secure integrity and accessibility over

time. This lack of availability of metadata and adequate repositories, that support the

trustworthiness and sensemaking of data, can be strengthened.

To strengthen the social- and content based trust, this research has recommended the further

incorporation of recordkeeping principles by creating standards for descriptive metadata at the

level of individual information objects that are valid during the complete life cycle of these

information objects and can be assessed by the re-users of information. At the same time the

information can be protected, while being transparent, promoting that the data reside in

(source)repositories that guarantee integrity, accessibility and usability over time.

It has to be taken into account continuously that data are not neutral: the design of sensors, the

modelling of data models, the classification of information, the design of algorithms, the

results of analysis, all are human activities that part from an interpreted and/or subjective

theory or design that influences the way the data can be used and interpreted.

58

The strengthening of the social and content based trust, by incorporation of the recordkeeping

principles in the design of data models, data processing and information systems will further

foment confidence and trust. Co-creation and participation of creators and (re)users of data

help to establish an environment in which data really fulfils its function to support the daily

processes of policy making, policy application and the execution of a lot of governmental

processes.

What has not been incorporated in this research, and could be a topic for further research, is to

investigate whether the absence of the identified recordkeeping princples, or the inability to

prove the presence, really influences the use or not use of the information. An attempt to

reflect on this has been incorporated in the assessment of the three datasets and reveals the

necessity and relevance to investigate this (considering for instance that of the air quality data

the (re)use was not known).

Trust and authoritativeness in the digital world have to be created combining a lot of different

key components, from having trustworthy digital identities, to authentication mechanisms to

verify that identities, to rights attached to information objects to be created, updated or read

by those identities. To be able to guarantee the robustness of an information ecosystem in

relation to trust authoritativeness is to create this ecosystem by design incorporating the key

components. This is already difficult for government bodies, and seems to be almost

impossible for the diverse landscape of information systems in the connected world.

59

Bibliography

Acker, Amelia. (2017), “When is a record?”, in Research in the Archival Multiverse, edited by Anne J. Gilliland, Sue Mc Kemmish and Andrew J. Lau. Monash University Publishing, p. 288-323. Bearman, David. (1993), “Record-Keeping Systems.” in Archivaria, 36 (1993) p. 16-36.

Borglund, E. and Engvall, T. (2014), “Open data?: Data, information, document or record?”, Records Management Journal, Vol. 24 No. 2, pp. 163-180. Borgman Christine L. (2015), Big Data, Little Data, No Data: Scholarship in the Networked World. Cambridge, MA: MIT Press, 2015

Boyd, Danah & Kate Crawford (2012), Critical Questions for Big Data, Information, Communication & Society, 15:5, 662-679. Casellas, L.E., Oliveras, S. and Reixach, M. (2012), “The authenticity of Data-Centric Systems at the Girona City Council”, InterPARES 3 Project, Catalonia TEAM Case Study. Available at: www.interpares.org/ip3/display_file.cfm?docip3_catalonia_cs02_final_report_EN.pdf Ceolin, D. et al. (2016), Combining User Reputation and Provenance Analysis for Trust Assessment in Journal of Data and Information Quality (JDIQ), 06 June 2016, Vol.7(1-2), pp.1-28 Donaldson, D.R. & Conway, P. (2015): User conceptions of trustworthiness for digital archival documents in Journal of the Association for Information Science and Technology, December 2015, Vol.66(12), pp.2427-2444

Duranti L. (2018), 'Whose truth? Records and archives as evidence in the era of post-truth and disinformation' in Brown, Caroline (eds) Archival Futures London : Facet Publishing, 2018 Faniel, I.M., Frank, R.D., Yakel, E., Context from the data reuser’s point of view, Journal of

Documentation, 26 September 2019, Vol.75(6), pp.1274-1297

Floridi, L. (2011) “The philosophy of information”: Oxford University Press, Reprinted 2014 Gillian O. (2018), AA02User perspectives of trust, InterPARES Trust Project. Available at: https://interparestrust.org/assets/public/dissemination/AA02FinalReport.pdf Gilliland, A. J. (2016), Setting the stage in Introduction to metadata: Pathways to digital information. Third Edition. Murtha Baca ed. Available at: http://www.getty.edu/publications/intrometadata/setting-the-stage Hurley, Grant, Valerie Léveillé, and John McDonald. Managing Records of Citizen Engagement Initiatives: A Primer. InterPARES Trust Project, 2016. Heersink, H, Bohmer, R and Giesen, S. (2017) “Verkenning raakvlakken GDI in relatie tot digitale archivering en duurzame toegankelijkheid”, Eindrapport ICTU C867

60

International Organization for Standardization (ISO) (2001), TC 46/SC 11. ISO 15489-1:2001 Information and Documentation. Records Management. Part 1: General, 1st ed., International Organization for Standardization, Geneva. Jaakkola, H., Makinen, T., Etelaaho, A.: Open data: opportunities and challenges. Paper presented at the Proceedings of the 15th International Conference on Computer Systems and Technologies, Ruse, Bulgaria (2014) Janssen, M., Charalabidis, Y. and Zuiderwijk, A. (2012), “Benefits, adoption barriers and myths of open data and open government”, Information Systems Management, Vol. 29 No. 4, pp. 258-268. Kelton, K., Fleischmann, K.R., & Wallace, W.A. (2008). Trust in digital information. Journal of the American Society for Information Science and Technology, 59(3), 363–374. Koesten, L, Gregory, K., Groth, P., Simperl, E. (2020), “Talking datasets: Understanding data sensemaking behaviours”. Cornell University (2020). Available at: https://arxiv.org/abs/1911.09041 Kool, L., J. Timmer, L. Royakkers and R. van Est (2017). Opwaarderen – Borgen van publieke waarden in de digitale samenleving. Den Haag: Rathenau Instituut Lemieux, V.L., Gormly, B. and Rowledge, L (2014) "Meeting Big Data challenges with visual analytics: The role of records management", Records Management Journal, Vol. 24 Issue: 2, pp.122-141. Available at: https://doi.org/10.1108/RMJ-01-2014-0009 Lemieux V.L. (2017) “A Typology of Blockchain Recordkeeping Solutions and Some Reflections on their Implications for the Future of Archival Preservation”. Available at: http://dcicblog.umd.edu/cas/wp-content/uploads/sites/13/2017/06/Lemieux.pdf Léveillé, V., and Timms, K. (2015), Through a Records Management Lens: Creating a Framework for Trust in Open Government and Open Government Information.” in Canadian Journal of Information and Library Science39, No. 2(2015): 154-190. Mazon, J.N., Zubcoff, J.J., Garrig, I., Espinosa, R., Rodríguez, R. (2012): Open business intelligence: on the importance of data quality awareness in user-friendly data mining. Paper presented at the Proceedings of the 2012 Joint EDBT/ICDT Workshops, Berlin, Germany (2012) Mosley M. (2008) “DAMA-DMBOK Functional Framework” : DAMA International, 2008 Natonaal Archief. (2016e) “DUTO:Belangen in balans: Handreiking voorwaardering en selectie van archiefescheiden in de digitale tijd”. Available at: https://www.nationaalarchief.nl/archiveren/kennisbank/handreiking-waardering-en-selectie

Reigeluth, Tyler (2014) ‘Why data is not enough: digital traces as control of self and self-control’ in Surveillance & Society 12(2) 243-254 Richards, M. (2015) “Software Architecture Patterns: Understanding Common Architecture Patterns and when to use them”: O'Reilly Media, Incorporated, 2015 Richards, Neil M. and Jonathan H. King (2013). ‘Three paradoxes of big data’. Stanford Law Review, September 2013. Available at: https://review.law.stanford.edu/wp-content/uploads/sites/3/2016/08/66_StanLRevOnline_41_RichardsKing.pdf

61

Serra, L.E.C. (2014), “The mapping, selecting and opening of data: the records management contribution to the open data project in Girona city council”, Records Management Journal, Vol. 24 No. 2, pp. 87-98. Strong, D.M., Lee, Y.W., Wang, R.Y. (1997): 10 potholes in the road to information quality. IEEE Comput. 30(8), 38–46 (1997) Suderman J. and Timms, K. (2017), NA08 Open Data, Open Government and Big Data: Implications for the Management of Records in an Online Environment. Researchers: Final Report. Available at: https://interparestrust.org/assets/public/dissemination/IPT_NA08_FinalReport_1Oct2016_fordistribution_.pdf Sunlight Foundation (2010), Ten Principles for Opening Up Government Information. Available online: https://sunlightfoundation.com/policy/documents/ten-open-data-principles/ Tennis, J.T. (2019), Evidence of Authenticity through “Metadata” and its Sources in Records Preservation NA16 Metadata: Mutatis mutandis –Design Requirements for Authenticity in the Cloud and Across Contexts. Available at: https://interparestrust.org/assets/public/dissemination/Tennis.pdf Ubaldi, B. (2013), “Open government data: towards empirical analysis of open government data initiatives”, OECD Working Papers on Public Governance, No. 22, OECD Publishing. Available at: http://dx.doi.org/10.1787/5k46bj4f03s7-en Upward F., Reed B., Oliver G. and Evans J. (2018), Recordkeeping informatics for a networked age: Monash University Publishing, Clayton, Victoria, 2018 Wang X., Govindan K, and Mohapatra P. (2011), “Collusion-resilient quality of information evaluation based on information provenance,” IEEE 8th Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks, 2011. Frederika Welle Donker & Bastiaan van Loenen (2017) How to assess the success of the open data ecosystem?, International Journal of Digital Earth, 10:3, 284-306. Available at: http://dx.doi.org/10.1080/17538947.2016.1224938 Geoffrey Yeo (2018), Record, Information and data, exploring the role of record-keeping in an information culture, University College London

Geoffrey Yeo, ‘Information, records, and the philosophy of speech acts’ in Frans Smit, Arnoud

Glaudemans and Rienk Jonker (eds) Archives in Liquid Times(‘s-Gravenhage 2017) 93-118. Available at: http://www.oapen.org/search?identifier=641001 Yoon, A. (2014), End users’ trust in data repositories: definition and influences on trust development in Arch Sci (2014) 14:17–34 Zuiderwijk, A., Janssen, M. and Dwivedi, Y.K. (2015), “Acceptance and use predictors of open data technologies: drawing upon the unified theory of acceptance and use of technology”, Government Information Quarterly, Vol. 32 No. 4, pp. 429-440.


Recommended