Acomparisonofresearchdatamanagementplatforms · While data citation is not yet a widespread...

Noname manuscript No.

(will be inserted by the editor)

A comparison of research data management platforms

Architecture, flexible metadata and interoperability

Ricardo Carvalho Amorim, João Aguiar Castro, João Rocha da Silva,

Cristina Ribeiro

Received: date / Accepted: date

Abstract Research data management is rapidly be-coming a regular concern for researchers, and institu-tions need to provide them with platforms to supportdata organization and preparation for publication. Someinstitutions have adopted institutional repositories asthe basis for data deposit, whereas others are experi-menting with richer environments for data description,in spite of the diversity of existing workflows. This pa-per is a synthetic overview of current platforms thatcan be used for data management purposes. Adopt-ing a pragmatic view on data management, the paperfocuses on solutions that can be adopted in the long-tail of science, where investments in tools and man-power are modest. First, a broad set of data mana-gement platforms is presented—some designed for in-stitutional repositories and digital libraries—to selecta short list of the more promising ones for data ma-nagement. These platforms are compared considering

This paper is an extended version of a previously publishedcomparative study. Please refer to the WCIST 2015 confer-ence proceedings (doi: 10.1007/978-3-319-16486-1)

Ricardo Carvalho AmorimINESC TEC—Faculdade de Engenharia da Universidade doPortoE-mail: [email protected]

João Aguiar CastroINESC TEC—Faculdade de Engenharia da Universidade doPortoE-mail: [email protected]

João Rocha da SilvaINESC TEC—Faculdade de Engenharia da Universidade doPortoE-mail: [email protected]

Cristina RibeiroINESC TEC—Faculdade de Engenharia da Universidade doPortoE-mail: [email protected]

their architecture, support for metadata, existing pro-gramming interfaces, as well as their search mechanismsand community acceptance. In this process, the stake-holders’ requirements are also taken into account. Theresults show that there is still plenty of room for im-provement, mainly regarding the specificity of data de-scription in different domains, as well as the potentialfor integration of the data management platforms withexisting research management tools. Nevertheless, de-pending on the context, some platforms can meet all orpart of the stakeholders’ requirements.

1 Introduction

The number of published scholarly papers is steadilyincreasing, and there is a growing awareness of the im-portance, diversity and complexity of data generatedin research contexts [25]. The management of these as-sets is currently a concern for both researchers and in-stitutions who have to streamline scholarly communi-cation, while keeping record of research contributionsand ensuring the correct licensing of their contents [23,18]. At the same time, academic institutions have newmandates, requiring data management activities to becarried out during the research projects, as a part ofresearch grant contracts [14,26]. These activities areinvariably supported by software platforms, increasingthe demand for such infrastructures.

This paper presents an overview of several promi-nent research data management platforms that can beput in place by an institution to support part of itsresearch data management workflow. It starts by iden-tifying a set of well known repositories that are cur-rently being used for either publications or data ma-nagement, discussing their use in several research in-

Cristina Ribeiro

This is a post-peer-review, pre-copyedit version of an article published in Universal Access in the Information Society. The final authenticated version is available online at: https://doi.org/10.1007/s10209-016-0475-y

Cristina Ribeiro

2 Ricardo Carvalho Amorim, João Aguiar Castro, João Rocha da Silva, Cristina Ribeiro

stitutions. Then, focus moves to their fitness to han-dle research data, namely their domain-specific meta-data requirements and preservation guidelines. Imple-mentation costs, architecture, interoperability, contentdissemination capabilities, implemented search featuresand community acceptance are also taken into consider-ation. When faced with the many alternatives currentlyavailable, it can be difficult for institutions to choose asuitable platform to meet their specific requirements.Several comparative studies between existing solutionswere already carried out in order to evaluate differentaspects of each implementation, confirming that this isan issue with increasing importance [16,3,6]. This eval-uation considers aspects relevant to the authors’ ongo-ing work, focused on finding solutions to research datamanagement, and takes into consideration their past ex-perience in this field [33]. This experience has providedinsights on specific, local needs that can influence theadoption of a platform and therefore the success in itsdeployment.

It is clear that the effort in creating metadata forresearch datasets is very different from what is requiredfor research publications. While publications can be ac-curately described by librarians, good quality metadatafor a dataset requires the contribution of the researchersinvolved in its production. Their knowledge of the do-main is required to adequately document the datasetproduction context so that others can reuse it. Involv-ing the researchers in the deposit stage is a challenge, asthe investment in metadata production for data publi-cation and sharing is typically higher than that requiredfor the addition of notes that are only intended for theirpeers in a research group [7].

Moreover, the authors look at staging platforms,which are especially tailored to capture metadata re-cords as they are produced, offering researchers an in-tegrated environment for their management along withthe data. As this is an area with several proposals inactive development, EUDAT, which includes tools fordata staging, and Dendro, a platform proposed for en-gaging researchers in data description, taking into ac-count the need for data and metadata organisation willbe contemplated.

Staging platforms are capable of exporting the en-closed datasets and metadata records to research datarepositories. The platforms selected for the analysis inthe sequel as candidates for use are considered as re-search data management repositories for datasets inthe long tail of science, as they are designed with shar-ing and dissemination in mind. Together, staging plat-forms and research data repositories provide the tools tohandle the stages of the research workflow. Long-termpreservation imposes further requirements, and other

tools may be necessary to satisfy them. However, as da-tasets become organised and described, their value andtheir potential for reuse will prompt further preserva-tion actions.

2 From publications to data management

The growth in the number of research publications,combined with a strong drive towards open access poli-cies [8,10], continue to foster the development of open-source platforms for managing bibliographic records.While data citation is not yet a widespread practice, theimportance of citable datasets is growing. Until a cul-ture of data citation is widely adopted, however, manyresearch groups are opting to publish so-called “datapapers”, which are more easily citable than datasets.Data papers serve not only as a reference to datasetsbut also document their production context [9].

As data management becomes an increasingly im-portant part of the research workflow [24], solutions de-signed for managing research data are being activelydeveloped by both open-source communities and datamanagement-related companies. As with institutionalrepositories, many of their design and development chal-lenges have to do with description and long-term preser-vation of research data. There are, however, at leasttwo fundamental differences between publications anddatasets: the latter are often purely numeric, makingit very hard to derive any type of metadata by sim-ply looking at their contents; also, datasets require de-tailed, domain-specific descriptions to be correctly in-terpreted. Metadata requirements can also vary greatlyfrom domain to domain, requiring repository data mod-els to be flexible enough to adequately represent theserecords [35]. The effort invested in adequate datasetdescription is worthwhile, since it has been shown thatresearch publications that provide access to their basedata consistently yield higher citation rates than thosethat do not [27].

As these repositories deal with a reasonably smallset of managed formats for deposit, several referencemodels, such as the OAIS (Open Archival InformationSystem) [12] are currently in use to ensure preservationand to promote metadata interchange and dissemina-tion. Besides capturing the available metadata duringthe ingestion process, data repositories often distributethis information to other instances, improving the pub-lications’ visibility through specialised research searchengines or repository indexers. While the former focuson querying each repository for exposed contents, thelatter help users find data repositories that match theirneeds—such as repositories from a specific domain orstoring data from a specific community. Governmental

A comparison of research data management platforms 3

institutions are also promoting the disclosure of opendata to improve citizen commitment and governmenttransparency, and this motivates the use of data mana-gement platforms in this context.

2.1 An overview on existing repositories

While depositing and accessing publications from dif-ferent domains is already possible in most institutions,ensuring the same level of accessibility to data resourcesis still challenging, and different solutions are being ex-perimented to expose and share data in some communi-ties. Addressing this issue, we synthesize a preliminaryclassification of these solutions according to their spe-cific purpose: they are either targeting staging, earlyresearch activities or managing deposited datasets andmaking them available to the community.

Table 1 identifies features of the selected platformsthat may render them convenient for data management.To build the table, the authors resorted to the docu-mentation of the platforms, and to basic experimentswith demonstration instances, whenever available. Inthe first column, under “Registered repositories”, is thenumber of running instances of each platform, accord-ing to the OpenDOAR platform as of mid-October 2015.

In the analysis, five evaluation criteria that can berelevant for an institution to make a coarse-grained as-sessment of the solutions are considered. Some exist-ing tools were excluded from this first analysis, mainlybecause some of their characteristics place them out-side of the scope of this work. This is the case of plat-forms specifically targeting research publications (andthat cannot be easily modified for managing data), andheavy-weight platforms targeted at long-term preserva-tion. Also excluded were those that, from a technicalpoint of view, do not comply with desirable require-ments for this domain such as adopting an open-sourceapproach, or providing access to their features via com-prehensive APIs.

By comparing the number of existing installations,it is natural to assume that a large number of instancesfor a platform is a good indication of the existence ofsupport for its implementation. Repositories such asDSpace are widely used among institutions to managepublications. Therefore, institutions using DSpace tomanage publications can use their support for the plat-form to expand or replicate the repository and meetadditional requirements.

It is important to mention that some repositoriesdo not implement interfaces with existing repositoryindexers, and this may cause the OpenDOAR statisticsto show a value lower than the actual number of existing

installations. Moreover, services provided by EUDAT,Figshare and Zenodo, for instance, consist of a singleinstallation that receives all the deposited data, ratherthan a distributed array of manageable installations.

Government-supported platforms such as CKAN arecurrently being used as part of the open government ini-tiatives in several countries, allowing the disclosure ofdata related to sensitive issues such as budget execu-tion, and their aim is to vouch for transparency andcredibility towards tax payers [21,20]. Although notspecifically tailored to meet research data managementrequirements, these data-focused repositories also countwith an increasing number of instances supporting com-plex research data management workflows [38], even atuniversities1.

Access to the source code can also be a valuable cri-terion for selecting a platform, primarily to avoid ven-dor lock-in, which is usually associated with commer-cial software or other provided services. Vendor lock-in is undesirable from a preservation point of view asit places the maintenance of the platform (and conse-quently the data stored inside) in the hands of a singlevendor, that may not be able to provide support indef-initely. The availability of the a platform’s source codealso allows additional modifications to be carried outin order to create customized workflows—examples in-clude improved metadata capabilities and data brows-ing functionalities. Commercial solutions such as Con-tentDM may incur high costs for the subscription fees,which can make them cost-prohibitive for non-profit or-ganizations or small research institutions. In some casesonly a small portion of the source code for the entiresolution is actually available to the public. This is thecase with EUDAT, where only the B2Share module iscurrently open2—the remaining modules are unavail-able to date.

From an integration point of view, the existence ofan API can allow for further development and help withthe repository maintenance, as the software ages. Solu-tions that do not, at least partially, comply with thisrequirement, may hinder the integration with externalplatforms to improve the visibility of existing contents.The lack of an API creates a barrier to the developmentof tools to support a platform in specific environments,such as laboratories that frequently produce data tobe directly deposited and disclosed. Finally, regardinglong-term preservation, some platforms fail to provideunique identifiers for the resources upon deposit, mak-ing persistent references to data and data citation inpublications hard.

1 http://ckan.org/2013/11/28/ckan4rdm-st-andrews/2 Source code repository for B2Share is hosted via GitHub

at https://github.com/EUDAT-B2SHARE/b2share

http://ckan.org/2013/11/28/ckan4rdm-st-andrews/

https://github.com/EUDAT-B2SHARE/b2share


Table 1: Limitations of the identified repository solutions. Source:

5OpenDOAR platform 4Corresponding web-site. †Only available through additional plug-ins.

⇤Only partially.

Registeredrepositories5

Closedsource

NoAPI

No uniqueidentifiers

Complexinstallation or setup

No OAI-PMHcompliance

CKAN 1394 5† 5⇤

ContentDM 53 5

Dataverse 2

Digital Commons 141 5 5

DSpace 1305

ePrints 407 5†

EUDAT — 5⇤

Fedora 41 5

Figshare — 5

Greenstone 51 5 5 5

Invenio 20

Omeka 4 5 5†

SciELO 18 5

WEKO 40 No data

Zenodo —

Support for flexible research workflows makes somerepository solutions attractive to smaller institutionslooking for solutions to implement their data manage-ment workflows. Both DSpace and ePrints, for instance,are quite common as institutional repositories to man-age publications, as they offer broad compatibility withthe harvesting protocol OAI-PMH (Open Archives Ini-tiative Protocol for Metadata Harvesting) [22] and withpreservation guidelines according to the OAIS model.OAIS requires the existence of different packages withspecific purposes, namely SIP (Submission InformationPackage), AIP (Archival Information Package) and DIP(Dissemination Information Package). The OAIS ref-erence model defines SIP as a representation of pack-aged items to be deposited in the repository. AIP, onthe other hand, represents the packaged digital objectswithin the OAIS-compliant system, and DIP holds oneor several digital artifacts and their representation in-formation, in such a format that can be interpreted bypotential users.

2.2 Stakeholders in research data management

Several stakeholders are involved in dataset descriptionthroughout the data management workflow, playing animportant part in their management and dissemina-tion [24,7]. These stakeholders—researchers, research

institutions, curators, harvesters, and developers—playa governing role in defining the main requirements ofa data repository for the management of research out-puts. As key metadata providers, researchers are re-sponsible for the description of research data. Theyare not necessarily knowledgeable in data managementpractices, but can provide domain-specific, more or lessformal descriptions to complement generic metadata.This captures the essential data production context,making it possible for other researchers to reuse thedata [7]. As data creators, researchers can play a centralrole in data deposit by selecting appropriate file formatsfor their datasets, preparing their structure and pack-aging them appropriately [15]. Institutions are also mo-tivated to have their data recognized and preserved ac-cording to the requirements of funding institutions [17,26]. In this regard, institutions value metadata in com-pliance to standards, which make data ready for in-clusion in networked environments, therefore increas-ing their visibility. To make sure that this context iscorrectly passed, along with the data, to the preser-vation stage, curators are mainly interested in main-taining data quality and integrity over time. Usually,curators are information experts, so it is expected thattheir close collaboration with researchers can result inboth detailed and compliant metadata records.

Considering data dissemination and reuse, harves-

ters can be either individuals looking for specific data


or services which index the content of several reposito-ries. These services can make particularly good use ofestablished protocols, such as the OAI-PMH, to retrievemetadata from different sources and create an interfaceto expose the indexed resources. Finally, contributingto the improvement and expansion of these repositoriesover time, developers are concerned with the underly-ing technologies, an also in having extensive APIs topromote integration with other tools.

3 Scope of the analysis

The stakeholders in the data management workflow cangreatly influence whether research data is reused. Theselection of platforms in the analysis acknowledges theirrole, as well as the importance of the adoption of com-munity standards to help with data description and ma-nagement in the long run.

For this comparison, data management platformswith instances running at both research and govern-ment institutions have been considered, namely DSpace,CKAN, Zenodo, Figshare, ePrints, Fedora and EUDAT.If the long-term preservation of research assets is animportant requirement of the stakeholders in question,other alternatives such as RODA [30] and Archivemat-ica may also be considered strong candidates, since theyimplement comprehensive preservation guidelines notonly for the digital objects themselves but also for theirwhole life cycle and associated processes. On one hand,these platforms have a strong concern with long-termpreservation by strictly following existing standards suchas OAIS, PREMIS or METS, which cover the differ-ent stages of a long-term preservation workflow. On theother hand, such solutions are usually harder to installand maintain by institutions in the so-called long tail ofscience—institutions that create large numbers of smalldatasets, though do not possess the necessary financialresources and preservation expertise to support a com-plete preservation workflow [18].

The Fedora framework3 is used by some institutions,and is also under active development, with the recentrelease of Fedora 4. The fact that it is designed as aframework to be fully customized and instantiated, in-stead of being a “turnkey” solution, places Fedora in adifferent level, that can not be directly compared withother solutions. Two open-source examples of Fedora’simplementations are Hydra4 and Islandora5. Both areopen-source, capable of handling research workflows,and use the best-practices approach already implemen-

3 http://www.fedora-commons.org/4 http://projecthydra.org/5 http://islandora.ca/

ted in the core Fedora framework. Although these arenot present in the comparison table, this section willalso consider their strengths, when compared to theother platforms.

An overview of the previously identified stakehold-ers led to the selection of two important dimensionsfor the assessment of the platform features: their archi-tecture and their metadata and dissemination capabil-ities. The former includes aspects such as how they aredeployed into a production environment, the locationswhere they keep their data, whether their source codeis available, and other aspects that are related to thecompliance with preservation best practices. The latterfocuses on how resource-related metadata is handledand the level of compliance of these records with es-tablished standards and exchange protocols. Other im-portant aspects are their adoption within the researchcommunities and the availability of support for exten-sions. Table 2 shows an overview of the results of ourevaluation.

4 Platform comparison

Based on the selection of the evaluation scope, thissection addresses the comparison of the platforms ac-cording to key features that can help in the selectionof a platform for data management. Table 2 groupsthese features in two categories: (i) Architecture, forstructural-related characteristics; and (ii) Metadata anddissemination, for those related to flexible descriptionand interoperability. This analysis is guided by the usecases in the research data management environment.

4.1 Architecture

Regarding the architecture of the platforms, several as-pects are considered. From the point of view of a re-search institution, a quick and simple deployment ofthe selected platform is an important aspect. There aretwo main scenarios: the institution can either outsourcean external service or install and customize its ownrepository, supporting the infrastructure maintenancecosts. Contracting a service provided by a dedicatedcompany such as Figshare or Zenodo delegates platformmaintenance for a fee. The service-based approach maynot be viable in some scenarios, as some researchers orinstitutions may be reluctant to deposit their data ina platform outside their control [11]. DSpace, ePrints,CKAN or any Fedora-based solution can be installedand run completely under the control of the researchinstitution and therefore offer a better control over thestored data. As open-source solutions, they also have

http://www.fedora-commons.org/

http://projecthydra.org/

http://islandora.ca/


Table 2: Comparison of the selected research data management platforms

Feature DSpace CKAN Figshare Zenodo ePrints EUDAT

Arc

hit

ectu

re

Deployment Installation packageor service

Installation

packageService Service

Installation package

or serviceService

StorageLocation

Local or remoteLocal or

remoteRemote Remote Local or remote Remote

Maintenancecosts

Infrastructure

management

Infrastructure

management

Monthly

fee

Monthly

fee

Infrastructure

managementMonthly fee

Open Source 3 3 5 5 3 5

Customization 3 3 5Community

policies3 5

Internationalizationsupport

3 3 5 5 3 5

Embargo 3Private

Storage

Private

Storage3 3 3

Contentversioning

5 3 5 5 3 3

Pre-reservingDOI

3 5 3 3 3 3

Met

adat

a&

Dis

sem

inat

ion Exporting

schemasAny pre-loaded

schemasNone DC DC,

MARCXMLDC, METS,

MODS, DIDLDC,

MARC,MARCXML

Schemaflexibility

Flexible Flexible Fixed Fixed Fixed Flexible

Validation 3 5 5 3 3 3

Versioning 5 3 5 5 3 3

OAI-PMH 3 5 3 3 3 3

Record licensespecification

3 3 3 3 3 3

several supporters6 that contribute to their expansionwith additional plugins or extensions to meet specificrequirements. DSpace, CKAN and Zenodo allow a cer-tain degree of customization to satisfy the needs oftheir users: while Zenodo allows parametrization set-tings such as community-level policies, CKAN, DSpaceand Fedora—as open source solutions—can be furthercustomized, with improvements ranging from small in-terface changes to the development of new data visu-alization plugins [33,34]. Due to its complex architec-ture, DSpace may require a higher level of expertisewhen dealing with custom features. Hovever, its largersupporting community may help tackling such barri-ers. The same applies to Fedora as it requires the re-search institution to choose among different technolo-gies to design and implement the end-user interface,which can exclude it as an option if limited time orbudget restrictions apply. A positive aspect in all pack-aged platforms is that they provide easy international-ization support. The Zenodo and Figshare services are

6 http://ckan.org/instances/http://registry.duraspace.org/registry/dspace

available in English only, as well as the majority of EU-DAT’s interfaces—an exception is its B2Share module,which is built on the Invenio platform, which alreadyhas internationalization features.

A collaborative environment for teams and groupsto manage the deposited resources is becoming increas-ingly important in the research workflows of many in-stitutions. In this regard, both CKAN and Zenodo pro-vide collaborative tools and allow users to fully managetheir group members and policies. ePrints and Dspaceare not designed to support real-time collaborative en-vironments where researchers can produce data and de-scribe them incrementally, so these platforms can beless suited to support dynamic data production envi-ronments. Adopting a dynamic approach to data ma-nagement, tasks can be made easier for the researchers,and motivate them to use the data management plat-form as part of their daily research activities, whilethey are working on the data. Otherwise, researchersmay only consider depositing data in the platform af-ter datasets are finished—no longer in active gatheringor processing—and this is likely to reduce the numberof datasets that get into the deposit phase. Moreover,

http://ckan.org/instances/

http://registry.duraspace.org/registry/dspace


different researchers may have a different approach todataset structure and description, and this will causedifficulties to the workflows that rely solely on deposit.EUDAT provides a collaborative environment by inte-grating file management and sharing into the researchworkflow via a desktop application. This applicationcan automatically synchronize files to one of the en-vironment’s modules (B2Drop). After the files are up-loaded, they can be used for computation in B2Stage orshared in B2Share to major portals in several researchareas. They can also be available for search in B2Find,the repository of the EUDAT environment designed toe an aggregator for metadata on research datasets. EU-DAT’s B2Share service is built on the Invenio data ma-nagement platform. This platform is flexible, availableunder an open-source license, and compatible with sev-eral metadata representations, while still providing acomplete API. However, it could be hard to manageand possibly decommission an Invenio platform in thefuture, since its underlying relational model is complexand very tightly connected to the platform’s code [35].

The control over data release dates can also be aconcern for researchers. DSpace, ePrints, Zenodo andEUDAT allow users to specify embargo periods; datais made available to the community after they expire.CKAN and Figshare have options for private storage,to let researchers control the data publication mode.

4.2 Metadata: a key for preservation

Research data can benefit from domain-level metadatato contextualize their production [37]. While the eval-uated platforms have different description requirementsupon deposit, most of them lack the support for domain-specific metadata schemas. In this regard DSpace isan exception, with its ability to use multiple schemasthat can be set up by a system administrator. Thesame happens with Islandora, which uses the supportfor descriptive metadata available in Fedora, allowingthe creation of tailored metadata forms, if the corre-sponding plugin is installed. This is a solution for therequirement of providing research data with domain-level metadata, a matter that is still to be addressed byseveral other platforms. Both Zenodo and Figshare canexport records that comply with established metadataschemas (Dublin Core and MARC-XML, and DublinCore, respectively). DSpace goes further by exportingDIPs that include METS metadata records, thus en-abling the ingestion of these packages into a long-termpreservation workflow. Although CKAN metadata re-cords do not follow any standard schema, the platformallows the inclusion of a dictionary of key-value pairsthat can be used, for instance, to record domain-specific

metadata as a complement to generic metadata descrip-tions. Neither of these platforms natively supports col-laborative validation stages where curators and resear-chers enforce the correct data and metadata structure,although Zenodo allows the users to create a highly cu-rated area within communities, as highlighted in the“validation” feature in Table 2. If the policy of a par-ticular community specifies manual validation, everydeposit will have to be validated by the communitycurator. EUDAT does not support domain-dependentmetadata, however it can gather different sets of de-scriptors when depositing to different projects usingB2Share. For example, when the user performing thedeposit chooses GBIF (a biodiversity infrastructure)as the target project for the new dataset, some pre-defined, biodiversity-related descriptors become avail-able to be filled in as a complement to the generic ones.These domain-specific descriptors can greatly improvegeneric descriptions. Datasets originate from very spe-cific research domains, thus requiring specific descrip-tions to be correctly interpreted by potential users.

Tracking content changes is also an important issuein data management, as datasets are often versionedand dynamic. CKAN provides an auditing trail of eachdeposited dataset by showing all changes made to itsince its deposit. EUDAT deals with the problem ofmetadata auditing in the same way, because its datasetsearch and retrieval engine, B2Find, is based on CKANtechnology7, and can therefore provide the same audit-ing trail interface.

4.3 Interoperability and dissemination

Exposing repository contents to other research plat-forms can improve both data visibility and reuse [24].All of the evaluated platforms allow the development ofexternal clients and tools as they already provide theirown APIs for exposing metadata records to the outsidecommunity, with some differences regarding standardscompliance. In this matter, only CKAN is not nativelycompliant with OAI-PMH. This is a widely-used pro-tocol that promotes interoperability between reposito-ries while also streamlining data dissemination, and isa valuable resource for harvesters to index the contentsof the repository [22,13]. As an initiative originally de-signed for government data, it is understandable thatCKAN is missing this compliance, although it can leaveinstitutions reluctant to its adoption as they can alsohave interest in getting their datasets cited by the com-munity.

7 Please refer to http://eudat.eu/sites/default/files/DaanBroeder.pdf

http://eudat.eu/sites/default/files/DaanBroeder.pdf

http://eudat.eu/sites/default/files/DaanBroeder.pdf


Table 3: Key advantages of the evaluated repository platforms

Platform Key advantages

Figshare

– Gives credit to authors through citations and references– Can export reference to Mendeley, DataCite, RefWorks, Endnote, NLM and ReferenceManager– Records statistics related to citations and shares– Does not require any maintenance

Zenodo

– Allows creating communities to validate submissions– Supports Dublin Core, MARC and MARCXML for metadata exporting– Can export references to BibTeX, DataCite, DC, EndNote, NLM, RefWorks– Complies with OAI-PMH for data dissemination– Does not require any maintenance– Includes metadata records in the searchable fields

CKAN

– Is open-source and widely supported by the developer community– Features extensive and comprehensive documentation– Allows deep customization of its features– Can be fully under institutions control– Supports unrestricted (non standards-compliant) metadata– Has faceted search with fuzzy-matching– Records datasets change logs and versioning information

DSpace

– Can comply with domain-level metadata schemas– Is open-source and has a wide supporting community– Has an extensive, community maintained documentation– Can be fully under institutions control– Structured metadata representation– Compliant with OAI-PMH

ePrints– Can maintain records of changes in preservation metadata records– Compliant with OAI-PMH– Compliant with SWORD for multiple deposit

EUDAT– Modular approach that provides a variety of services to match local needs– Strong support form European agencies– Integration of several open-source platforms (CKAN, Invenio)– End-to-end workflow for research data management– Majority of features are available for free to european researchers

It is interesting not only to evaluate platforms ac-cording to the ease of discovery by machines, but alsoto see how easily humans can find a dataset there. Allthree platforms possess free-text search capabilities, in-dexing the metadata in dataset records for retrievalpurposes. All analyzed platforms provide an “advanced”search feature that is in practice a faceted search. De-pending on the platform, users can restrict the resultsto smaller sets, for instance from a domain such as En-gineering. This search feature makes it easier for rese-archers to find the datasets that are from relevant do-mains and belong to specific collections or similar data-set categories (the concept varies between platforms asthey have different organizational structures). ePrints,for instance, allows search on the metadata records, in-cludes boolean operators to refine the results as well asfull text search for some of the compatible data formats,provided the appropriate plugins are installed. Whenconsidering the involved technologies, DSpace standsout as it natively uses Apache Lucene as a search en-gine which competes with the Xapian8 engine used inePrints, to sort results by relevance.

8 http://xapian.org/

4.4 Platform adoption

As most recent platforms, all the repositories dependon a community of developers to maintain and improvetheir features. Looking for successful case studies, it isimportant to assess their impact and comprehensive-ness. CKAN has several success cases with governmentdata which are made available to the community, al-though missing other scenarios related to the manage-ment and disclosure of research data. Figshare, Zenodoand DSpace have research data as their focus. In activeuse since 2002, DSpace is well known among institutionsand researchers for its capabilities to deal with researchpublications and, more recently, also to handle researchdata. DSpace benefits from a dominant position in insti-tutional repositories and the existence of such instancescan favour its adoption for dataset management. Zen-odo is a solution for the long tail of science supportedby CERN laboratories, and is regarded as an environ-ment to bring research outputs to an appropriate digitalarchive for preservation. It is therefore also a strong usecase, with researchers from many fields already using it.

http://xapian.org/


5 Data staging platforms

Most of the analyzed solutions target data repositories,i.e. the end of the research workflow. They are designedto hold and manage research data outputs after thedata production is concluded and the results of theiranalysis are published. As a consequence, there is anoverall lack of support for capturing data during theearlier stages of research activities.

Introducing data management—and metadata pro-duction particularly—at an early stage in the researchworkflow increases the chances of a dataset reachingthe final stage of this workflow, when it is kept in along-term preservation environment. The introductionof data deposit and description earlier in the researchworkflow means that descriptions will already be par-tially done by the end of data gathering. Also, moredetailed and overall better metadata records can be cre-ated in this way, since the data creation context is stillpresent. Researchers can also reap immediate benefitsfrom their data description, as described datasets canmore easily be shared among the members of their re-search group or with external partners.

Data gathering is often a collaborative process, soit makes sense to make metadata production collabo-rative as well. These requirements have been identifiedby several research and data management institutions,who have implemented integrated solutions for resear-chers to manage data not only when it is created, butalso throughout the entire research workflow.

Researchers are not data management experts, sothey need effective tools that allow them to produce ad-equate standards-compliant metadata records withouthaving to learn about those standards. Thus, an impor-tant characteristic of an effective solution for collabora-tive data management is its ease of use by non-experts.If these solutions are easy to use and provide both im-mediate and long-term added value for researchers, theyare more likely to be adopted as part of the daily re-search work. Gradually, this would counteract the ideathat data management is a time-consuming process per-formed only due to policies enforced by funding institu-tions, or motivated by uncertain and long-term rewardssuch as the possibility of others citing the datasets.

5.1 Data management as a routine task

There have been important advancements towards theincorporation of data management practices in the day-to-day activities of researchers.

In the UK, the DataFlow project [19] was built toprovide researchers with an integrated data manage-ment workflow to allow them to store and describe their

data safely and easily. The project implemented twocomponents: DataStage and DataBank. DataStage al-lows researchers standards-based (CIFS, SFTP, SSH,WebDAV) access to shared data storage areas protectedby automated backups, as well as a web interface thatresearchers can use to add metadata to the files thatthey deposit. The shared storage is accessible from thecomputers used for their work through a mapped drive.When researchers are ready to deposit a dataset, theycan package it as a ZIP file and send it to DataBank viaa SWORD endpoint. DataBank is a repository platformthat, besides supporting ingestion via the SWORD v2protocol, supports DOI registration via DataCite, ver-sion control, specification of embargo periods and OAI-PMH compliance to foster the dissemination of data.File format-related operations—such as correct identi-fication of the format for a file—are handled by existingtools such as JHOVE and DROID [5].

datorium is a platform for the description and shar-ing of research data from the social sciences. Realizingthe increasing requirements for base data as supple-mentary material to research publications, its goal is toprovide an easy to use platform for researchers to per-form autonomous description of their datasets. Meta-data is, like other platforms, limited to Dublin Core,complemented in this case with some elements takenfrom GESIS Data Catalogue DBK [1].

MaDAM [28] is a web-based data management sys-tem targeted at the management of research data in re-search groups. It provides a user-friendly file explorer,as well as an editor for adding metadata to the enti-ties in the folder structure. The descriptors that canbe added to a metadata record are fixed and general-purpose, such as “Name”, “Creator” or “Comments”. Theplatform also has an “Archive” function that allows usersto send a dataset to eScholar, the University of Manch-ester’s preservation and dissemination repository9.

DASH10, a data management platform in use atthe University of California, incorporates two previoustools: DataUP11 and DataShare12. It does not currentlysupport interoperability protocols for deposit or dis-semination of datasets such as OAI-PMH or SWORD,which leaves it outside of the present comparison. How-ever, it is an open-source project, and its modules arecurrently available13. It also provides an easy to use in-terface, indexing by scholarly engines, data identifica-

9 http://www.escholar.manchester.ac.uk/10 https://dash.library.ucsc.edu/11 http://dataup.cdlib.org12 http://datashare.ucsf.edu/xtf/search13 http://cdluc3.github.io/dash/

http://www.escholar.manchester.ac.uk/

https://dash.library.ucsc.edu/

http://dataup.cdlib.org

http://datashare.ucsf.edu/xtf/search

http://cdluc3.github.io/dash/


tion via DOI and integration with Merritt, an in-housedeveloped long-term repository14.

HAL is a platform for the deposit, description anddissemination of research datasets. It provides a wiki-pedia plugin to modify the layout of Wikipedia pagesand directly include links to datasets. This can help re-searchers find data in Wikipedia pages. The metadatathat can be added to each dataset is limited to a set ofgeneric, fixed descriptors, whose values can be derivedfrom the content of relevant Wikipedia pages [29].

As a pan-european effort for the creation of an in-tegrated research data management environment, EU-DAT also includes a file sharing module called B2Drop.It provides researchers with 20GB of storage for free,and is integrated with other modules for dataset sharingand staging, including some computational processingon the stored data.

Several interesting concepts have been recently pre-sented as part of an integrated vision for the mana-gement of research data within research groups. Somecore concepts currently found in social networks canbe applied to research data management, making it anatural part of the daily activities of researchers [4].They include a timeline of changes over resources un-der the group’s control, comments that are linked tothose changes, external sharing controlled by the ele-ments of the research group and the ability to trackthe interactions of external entities with the dataset(such as citations and “likes”). In this view, researchersare able to browse datasets deposited by group mem-bers as they are produced, and also run workflows overthat data. The continuous recording of both data andthe translation steps that allow a dataset to be derivedfrom others is a very interesting concept not only froma preservation point of view, but also in scientific terms,as it safeguards the reproducibility of research findings.

5.2 Dendro

UPBox and DataNotes where designed an implemen-ted at the University of Porto as coupled solutions toprovide users with an integrated data management en-vironment [31]. UPBox was designed to provide resear-chers with a shared data storage environment, fully un-der their research institution control and complementedby an easy-to-use REST API to allow its integrationwith multiple services. DataNotes was a modified ver-sion of Semantic MediaWiki, designed to work withUPBox, allowing researchers to produce wiki-formatted

14 http://guides.library.ucsc.edu/datamanagement/publish

pages describing the files and folders that they had pre-viously sent to the data storage environment. The gen-erated metadata would use descriptors from diverse on-tologies from multiple domains and could be exportedas RDF records.

The lessons learned during the implementation ofthese two solutions and through the ongoing analy-sis of requirements in research groups, led to the de-velopment of Dendro. Dendro is a single solution tar-geted at improving the overall availability and qual-ity of research data. It aims at engaging researchersin the management and description of their data, fo-cusing on metadata recording at the early stages ofthe research workflow [32,36]. Dendro is a fully open-source environment (solution and dependencies) thatcombines an easy to use file manager (similar to Drop-boxb) with the collaborative capabilities of a semanticwiki for the production of semantic metadata records.The solution aims at the description of datasets fromdifferent research domains through an extensible, triplestore-based data model [35]. Curators can expand theplatform’s data model by loading ontologies that spec-ify domain-specific or generic metadata descriptors thatcan then be used by researchers in their projects. Theseontologies can be designed using tools such as Pro-tégé15, allowing curators with no programming back-ground to extend the platform’s data model. Dendrois designed primarily as a staging environment for da-taset description. Ideally, as research publications arewritten, associated datasets (already described at thispoint) are packaged and sent to a research data repos-itory, where they go through the deposit workflows. Inthe end, the process can be made fast enough to enableresearchers to cite the datasets in the publication itselfas supporting data.

Dendro focuses on interoperability to make the de-posit process as easy as possible for researchers. It canbe integrated with all the repository platforms surveyedin this paper, while its extensive API makes it easyto integrate with external systems. LabTablet, an elec-tronic laboratory notebook designed to help researchersgather metadata in experimental contexts, is an exam-ple of a successful integration scenario. It allows rese-archers to generate metadata records using the mobiledevice’s on-board sensors, which are then representedusing established metadata schemas (e.g. Dublin Core)and uploaded to a Dendro instance for collaborativeediting [2].

15 Available at http://protege.stanford.edu/

http://guides.library.ucsc.edu/datamanagement/publish

http://guides.library.ucsc.edu/datamanagement/publish

http://protege.stanford.edu/


6 Conclusion

The evaluation showed that it can be hard to select aplatform without first performing a careful study of therequirements of all stakeholders. The main positive as-pects of the platforms considered here are summarizedin Table 3. Both CKAN and DSpace’s open-source li-censes that allow them to be updated and customized,while keeping the core functionalities intact, are high-lighted.

Although CKAN is mainly used by governmentalinstitutions to disclose their data, its features and theextensive API making it also possible to use this repos-itory to manage research data, making use of its key-value dictionary to store any domain-level descriptors.This feature does not however strictly enforce a meta-data schema. Curators may favor DSpace though, sinceit enables system administrators to parametrize addi-tional metadata schemas that can be used to describeresources. These will in turn be used to capture richerdomain-specific features that may prove valuable fordata reuse.

Researchers need to comply with funding agency re-quirements, so they may favour easy deposit combinedwith easy data citation. Zenodo and Figshare provideways to assign a permanent link and a DOI, even if theactual dataset is under embargo at the time of first ci-tation. This will require a direct contact between thedata creator and the potential reuser before access canbe provided. Both these platforms are aimed at the di-rect involvement of researchers in the publication oftheir data, as they streamline the upload and descrip-tion processes, though they do not provide support fordomain-specific metadata descriptors.

A very important factor to consider is also the con-trol over where the data is stored. Some institutionsmay want the servers where data is stored under theircontrol, and to directly manage their research assets.Platforms such as DSpace and CKAN, that can be in-stalled in an institutional server instead of relying onexternal storage provided by contracted services are ap-propriate for this.

The evaluation of research data repositories can takeinto account other features besides those considered inthis analysis, namely their acceptance within specificresearch communities and their usability. The paperhas focused on repositories as final locations for re-search data to be deposited and not as a replacementfor the tools that researchers already use to managetheir data—such as file sharing environments or morecomplex e-science platforms. The authors consider thatthese solutions should be compared to other collabora-tive solutions such as Dendro, a research data mana-

gement solution currently under development. In thisregard, it can be argued that flexible, customizable so-lutions such as Dendro can meet the needs of researchinstitutions in terms of staging, temporary platforms tohelp with research data management and description.This should, of course, be done while taking into con-sideration available metadata standards that can con-tribute to overall better conditions for long-term preser-vation [36].

EUDAT features the integration of open-source es-tablished solutions (such as CKAN and Invenio) to sup-port a comprehensive data management workflow. Theplatform is backed by several prominent institutionsand promises to deliver an European data managementenvironment to support research. Areas for improve-ment in this project include metadata production andcollaboration. For example, limited domain-specific de-scriptors are available depending on the portal to whichthe dataset is being sent, instead of a fully flexibleand expansible metadata model that depends on theresearch domain, such as the one in Dendro [35,36]).Collaboration challenges include the implementation ofsocial-network based concepts for real-time collabora-tion [4].

Considering small institutions that somehow strug-gle to contract a dedicated service for data managementpurposes, having a wide community supporting the de-velopment of a stand-alone platform can be a valuableasset. In this regard, CKAN may have an advantageover the remaining alternatives, as several governmen-tal institutions are already converging to this platformfor data publishing.

Acknowledgements

This work is supported by the project NORTE-07-0124-FEDER000059, financed by the North Portugal Re-gional Operational Programme (ON.2–O Novo Norte),under the National Strategic Reference Framework(NSRF), through the European Regional DevelopmentFund (ERDF), and by national funds, through the Por-tuguese funding agency, Fundação para a Ciência e aTecnologia (FCT). João Rocha da Silva is also sup-ported by research grant SFRH/BD/77092/2011, pro-vided by the Portuguese funding agency, Fundação paraa Ciência e a Tecnologia (FCT).

References

1. Alam, A.W., Müller, S., Schumann, N.: datorium : Shar-ing platform for social science data. In: Proceedings ofthe 14th International Symposium on Information Sci-ence (ISI 2015), pp. 244–249 (2015)


2. Amorim, R.C., Castro, J.A., Rocha da Silva, J., Ribeiro,C.: Labtablet: Semantic metadata collection on a multi-domain laboratory notebook. Springer Communica-tions in Computer and Information Science 478, 193–205(2014)

3. Armbruster, C., Romary, L.: Comparing repository types:challenges and barriers for subject-based repositories, re-search repositories, national repository systems and in-stitutional repositories in serving scholarly communica-tion. International Journal of Digital Library Systems1(4) (2010)

4. Assante, M., Candela, L., Castelli, D., Manghi, P.,Pagano, P.: Science 2.0 repositories: Time for a change inscholarly communication. D-Lib Magazine 21(1) (2015)

5. Ball, A.: Tools for research data management. Tech. rep.,University of Bath, Bath, UK (2012)

6. Bankier, J.: Institutional repository software comparison.UNESCO Communication and Information 33 (2014)

7. Borgman, C.L.: The conundrum of sharing research data.Journal of the American Society for Information Scienceand Technology 63(6), 1059–1078 (2012)

8. Burns, C.S., Lana, A., Budd, J.: Institutional reposito-ries: exploration of costs and value. D-Lib Magazine19(1), 1 (2013)

9. Candela, L., Castelli, D., Manghi, P., Tani, A.: Data jour-nals: A survey. International Review of Research in Openand Distance Learning 66, 1747–1762 (2015)

10. Coles, S.J., Frey, J.G., Bird, C.L., Whitby, R.J., Day,A.E.: First steps towards semantic descriptions of elec-tronic laboratory notebook records. Journal of Chemin-formatics 5, 1–10 (2013)

11. Corti, L., Van den Eynden, V., Bishop, L., Woollard, M.:Managing and sharing research data: A guide to goodpractice. Records Management Journal 24(3), 252–253(2014)

12. Council of the Consultative Committee for Space DataSystems: Reference Model for an Open Archival Infor-mation System (OAIS). Tech. Rep. January (2002)

13. Devarakonda, R., Palanisamy, G.: Data sharing and re-trieval using OAI-PMH. Earth Science Informatics 4(1),1–5 (2011)

14. European Commission: Guidelines on open access to sci-entific publications and research data in horizon 2020.Tech. Rep. December (2013)

15. Van den Eynden, V., Corti, L., Bishop, L., Horton, L.:Managing and Sharing Data, 3 edn. UK Data ArchiveUniversity of Essex (2011)

16. Fay, E.: Repository software comparison: building digitallibrary infrastructure at LSE. Ariadne 64(2009), 1–11(2010)

17. Green, A., Macdonald, S., Rice, R.: Policy-making forResearch Data in Repositories: A Guide. London: JISCfunded DISC-UK Share Project (2009)

18. Heidorn, P.: Shedding light on the dark data in the longtail of science. Library Trends 57(2), 280–299 (2008)

19. Hodson, S.: ADMIRAL: A Data Management Infrastruc-ture for Research Activities in the Life sciences. Tech.rep., University of Oxford (2011)

20. Hoxha, J., Brahaj, A.: Open government data on theweb: A semantic approach. In: International Conferenceon Emerging Intelligent Data and Web Technologies, pp.107–113 (2011)

21. Kučera, J., Chlapek, D., Mynarz, J.: Czech CKAN repos-itory as case study in public sector data cataloging. Sys-témová Integrace 19(2), 95–107 (2012)

22. Lagoze, C., Sompel, H.V.D., Nelson, M., Warner, S.: TheOpen Archives Initiative Protocol for Metadata Harvest-ing. Proceedings of the first ACM/IEEE-CS Joint Con-ference on Digital Libraries (2001)

23. Lynch, C.A.: Institutional repositories: essential infras-tructure for scholarship in the digital age. portal: Li-braries and the Academy 3(2), 327–336 (2003)

24. Lyon, L.: Dealing with Data: Roles, Rights, Responsibil-ities and Relationships. Tech. rep., UKOLN, Universityof Bath (2007)

25. McNutt, M.: Improving scientific communication. Sci-ence 342(6154), 13 (2013)

26. National Science Foundation: Grants.gov ApplicationGuide: A Guide for Preparation and Submission of Na-tional Science Foundation Applications via Grants.gov.Tech. rep. (2011)

27. Piwowar, H.A., Vision, T.J.: Data reuse and the opendata citation advantage. PeerJ 1, e175 (2013)

28. Poschen, M., Finch, J., Procter, R., Goff, M., McDerby,M., Collins, S., Besson, J., Beard, L., Grahame, T.: De-velopment of a Pilot Data Management Infrastructure forBiomedical Researchers at University of Manchester—Approach, Findings, Challenges and Outlook of theMaDAM Project. International Journal of Digital Cu-ration 7, 110–122 (2012)

29. Rafes, K., Germain, C.: A platform for scientific datasharing. In: BDA2015 Bases de Donées Avancées (2015)

30. Ramalho, J.C., Ferreira, M., Faria, L., Castro, R.,Barbedo, F., Corujo, L.: RODA and CRiB a service-oriented digital repository. iPres Conference Proceedings(2008)

31. Rocha da Silva, J., Barbosa, J., Gouveia, M., CorreiaLopes, J., Ribeiro, C.: UPBox and DataNotes: a collabo-rative data management environment for the long tail ofresearch data. iPres Conference Proceedings (2013)

32. Rocha da Silva, J., Castro, J.A., Ribeiro, C., CorreiaLopes, J.: Dendro: collaborative research data manage-ment built on linked open data (2014)

33. Rocha da Silva, J., Ribeiro, C., Correia Lopes, J.:UPData—A Data Curation Experiment at U.Porto usingDSpace. In: iPres Conference Proceedings, pp. 224–227(2011)

34. Rocha da Silva, J., Ribeiro, C., Correia Lopes, J.: Man-aging multidisciplinary research data: Extending DSpaceto enable long-term preservation of tabular datasets. In:iPres Conference Proceedings, pp. 105–108 (2012)

35. Rocha da Silva, J., Ribeiro, C., Correia Lopes, J.:Ontology-based multi-domain metadata for research datamanagement using triple stores. In: Proceedings of the18th International Database Engineering & ApplicationsSymposium (2014)

36. Rocha da Silva, J., Ribeiro, C., Correia Lopes, J.: TheDendro research data management platform: Applyingontologies to long-term preservation in a collaborativeenvironment. iPres Conference Proceedings (2014)

37. Willis, C., Greenberg, J., White, H.: Analysis and Syn-thesis of Metadata Goals for Scientific Data. Journal ofthe Association for Information Science and Technology63(8), 1505–1520 (2012)

38. Winn, J.: Open data and the academy: An evaluationof CKAN for research data management. InternationalAssociation dor Social Science Information Services andTechnology (2013)

Date post:	09-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Acomparisonofresearchdatamanagementplatforms · While data citation is not yet a widespread...

Documents