DHARMa was a one-year project
addressing effective digital research
data preservation in the Humanities. It
was funded by the John Fell Fund and
managed by the Bodleian Digital
Library.
DHARMa Project Final Report November 2014
J McKnight;C Madsen;J Prag
Dharma Project Final Report 1
J. McKnight; C. Madsen; J. Prag | November 2014
Table of Contents
Introduction .......................................................................................................................... 3
The projects ............................................................................................................................................................................... 3
Context ......................................................................................................................................................................................... 3
The problem of preservation .............................................................................................................................................. 4
Strategic positioning .............................................................................................................................................................. 4
Focus on the Humanities ...................................................................................................................................................... 5
Assessment of existing tools and services .............................................................................. 7
ORA-Data ..................................................................................................................................................................................... 7
Depositing data .............................................................................................................................................................................. 7
Current and future work ......................................................................................................................................................... 10
Other software ........................................................................................................................................................................ 12
DataStage ....................................................................................................................................................................................... 12
ORDS (Online Research Database Service) ..................................................................................................................... 13
BEAM Web Deposit .................................................................................................................................................................... 13
External repositories ................................................................................................................................................................. 14
3rd-party software ..................................................................................................................................................................... 14
Service providers ................................................................................................................................................................... 16
Research Data Management single point of contact ................................................................................................. 16
Research Services / Research Accounts (UAS) .............................................................................................................. 16
Bodleian Libraries Digital Library Systems and Services (BDLSS) ..................................................................... 16
Academic IT Research Support team (IT Services) ..................................................................................................... 17
Infodev (IT Services) .................................................................................................................................................................. 17
IT Learning Programme (IT Services) .............................................................................................................................. 17
Subject Librarians ...................................................................................................................................................................... 17
BEAM (Bodleian Electronic Archives and Manuscripts) .......................................................................................... 18
Faculty and Department ITSS ............................................................................................................................................... 18
Methods & Activities ........................................................................................................... 18
Semi-structured interviews with researchers ........................................................................................................... 18
Conversations with IT Support Staff ............................................................................................................................. 19
Data acquisition ...................................................................................................................................................................... 20
Data ingest and metadata creation................................................................................................................................. 21
Case studies: research projects ............................................................................................ 22
Sphakia Survey ....................................................................................................................................................................... 22
Overview ......................................................................................................................................................................................... 22
Scope of research ........................................................................................................................................................................ 22
Research materials .................................................................................................................................................................... 23
Common findings ................................................................................................................ 28
2 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
Finding 1: Data v interfaces ............................................................................................................................................... 28
Finding 2: Funding gaps for sustainability and preservation ............................................................................. 28
Finding 3: Lack of reuse and evolution ......................................................................................................................... 29
Finding 4: Advice, training, and mentoring ................................................................................................................ 31
Finding 5: Repository policy on data and formats ................................................................................................... 32
Finding 6: ORA-Data API .................................................................................................................................................... 34
Results ................................................................................................................................ 35
Datasets in ORA-Data ........................................................................................................................................................... 35
Improved digital preservation guidelines ................................................................................................................... 35
Conclusions ......................................................................................................................... 36
Recommendations................................................................................................................................................................. 36
1. Preservation and sustainability ...................................................................................................................................... 36
2. Funding gaps for sustainability and preservation ................................................................................................. 36
3. Reuse and evolution .............................................................................................................................................................. 37
4. Advice, training, and mentoring ..................................................................................................................................... 37
5. Repository policy on data and formats ........................................................................................................................ 38
6. ORA-Data API .......................................................................................................................................................................... 39
Next steps ................................................................................................................................................................................. 39
1. Information ............................................................................................................................................................................... 39
2. Innovation ................................................................................................................................................................................. 39
3. Investment ................................................................................................................................................................................. 40
Dharma Project Final Report 3
J. McKnight; C. Madsen; J. Prag | November 2014
Introduction
DHARMa (Digital Humanities Archives for Research Materials) was a one-year project (August
2013 to August 2014), funded by the John Fell Fund. The project worked closely with 13 self-
selected Digital Humanities projects to investigate the nature of the research data they are using
and creating; to understand the preservation requirements arising from these; to pilot data
acquisition and ingest in ORA-Data; and to use the findings to inform the central provision of
data preservation services.
The projects
● Ashmolean Cyprus Digitisation Project (Anja Ulbrich)
● Centre for the Study of the Cantigas de Santa Maria (Stephen Parkinson)
● Creative Practice in Contemporary Concert Music (Eric Clarke / Mark Doffman)
● Dictionary of Medieval Latin from British Sources (Tobias Reinhardt / Richard
Ashdowne)
● Digital Miscellanies Index (Abigail Williams)
● Early Modern Festival Books (Helen Watanabe-O’Kelly)
● First World War Poetry Digital Archive & The Great War Archive (Stuart Lee / Katharine
Lindsay)
● Inscriptions of Sicily (Jonathan Prag)
● Last Statues of Antiquity (Bryan Ward-Perkins & Bert Smith)
● Lexicon of Greek Personal Names
● Oxford Archive of Russian Life History (Catriona Kelly)
● Oxford Roman Economy Project (Alan Bowman)
● Sphakia Survey (Lucia Nixon)
Context
Since the early 1970s Oxford has been at the cutting edge of the Digital Humanities (broadly
defined as the application of advanced digital technologies to humanities research). The
Digital.Humanities@Oxford website currently lists over 200 projects in this area, in which Oxford
has secured more grant awards than any other UK institution (see
http://digital.humanities.ox.ac.uk/SubjectAreas/subject_areas.aspx). The projects involve
leading academics across the Humanities disciplines (and beyond, in increasingly
interdisciplinary collaborations) as well as staff in the University’s IT Services, the Oxford e-
Research Centre (OeRC), the Bodleian Libraries, Museums, and Colleges. The data produced
by these projects constitutes a major corpus of research resources with the power to transform
the work of humanities scholars.
4 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
The problem of preservation
Digital preservation is the active management of digital content over time to ensure ongoing
access.1 Without appropriate preservation, the digital research data being created within
humanities projects may be lost to the research community. Digital preservation is a particular
issue in the humanities, where the scholarly benefits of research outputs necessarily accrue
over the medium- to long-term. Grant-giving bodies accordingly require firm guarantees for the
sustainability and preservation of the resources that they fund. The AHRC, for example, has
recently reviewed and strengthened its demands in this respect: it now includes a requirement
for applicants producing digital resources to complete a Technical Plan specifying exactly how,
and for how long, data outputs will be sustained and preserved. Other funders are making
similar requirements, with attention also turning toward ensuring maximum ‘value for money.’2
Preservation involves resource commitments beyond the end of any project’s funding period,
and in the humanities these costs are not normally covered by grant-awarding bodies. It is
therefore essential to consider a strategic and wide-ranging approach to how these costs can be
minimised, and the limited resources used in the most effective way possible.
The decentralised nature of Oxford makes preservation a particular challenge compared to
other institutions. In Oxford, some centralised services for research data creation, management,
and preservation are under development, but digital resources may be built and/or hosted by
any number of organisations or individuals including the Bodleian, IT Services, colleges, and
departments, with the latter two particularly struggling to meet long term preservation goals. To
organise preservation services efficiently – and demonstrate this to funders and donors – will be
a significant strength in Oxford’s capacity to bid for funding for digital projects. However, it
cannot be stressed enough that this is not simply a financial consideration, but rather a
fundamental necessity for the realization of the potential of digital research in the humanities.
Without efficient and effective preservation, the future of humanities research is at risk. Libraries
and archives are filled with humanities research data of the past, but the raw materials of
research of the future are very much endangered.
Strategic positioning
With more and more digital data being created, and funders increasing their requirements for
effective preservation, Oxford is already moving towards meeting the growing challenges of
research data management (RDM) and preservation. Several strategy documents recognise
RDM as vitally important:
1 See http://www.digitalpreservation.gov/about/ 2 EPSRC funding from May 2015 will require researchers to show that data will be available for 10 years from last access.
Dharma Project Final Report 5
J. McKnight; C. Madsen; J. Prag | November 2014
● The University of Oxford Strategic Plan 2013-183 names two key priorities, “global reach”
and “networking, communication, and interdisciplinarity”, both of which rely on effective
management and dissemination of research data (the section on IT Infrastructure further
develops this connection)
● One of the key objectives of the IT Strategic Plan4 is “To provide the infrastructure and
tools to allow researchers to be compliant with regulatory requirements to preserve and
share electronic research outputs.”
● The Bodleian Libraries’ Strategic Plan emphasises the importance of digital initiatives,
and a more detailed digital strategy (‘The Digital Shift’) is being developed5 which
stresses that preservation must be the bedrock of any successful digital strategy.
● Oxford has released a Policy on the Management of Research Data and Records6 which
highlights the value of research data “for research, teaching, and for wider exploitation
for the public good” as well as acknowledging the need for compliance with funder
requirements.
Against a background of strategy and policy creation, the infrastructure is being developed to
meet the growing requirements for research data management and preservation; however this
work is not sufficiently resourced, so development has been incremental and reactive.
Focus on the Humanities
This project focused specifically on the humanities partly because of a perception that the
discourse around research data management and preservation was strongly focused on the
needs of the sciences and social sciences, and that the humanities may have different
requirements which were in danger of being sidelined. In practice, the project has emphasised
that while there are common principles and practices which apply to all research materials
across the disciplines, there are also certain characteristics of humanities data which should be
taken into account when developing strategies for supporting and facilitating its preservation
and management. None of these issues, of course, are found exclusively in humanities
research; the social sciences may be regarded as the ‘nearest neighbour’ of the humanities in
these respects, and shared strategies should be explored, but nonetheless these are strong
trends within the humanities disciplines:
3 http://www.ox.ac.uk/media/global/wwwoxacuk/localsites/gazette/documents/supplements2012-13/University_of_Oxford_Strategic_Plan_2013-2018_(1)_to_No_5025.pdf 4 http://www.it.ox.ac.uk/about/itstrategy/itstrategicplan/ 5 http://www.bodleian.ox.ac.uk/bodley/news/2014-mar-3 6http://researchdata.ox.ac.uk/files/2014/01/Policy_on_the_Management_of_Research_Data_and_Records.pdf
6 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
● data is often captured from other sources rather than being created from scratch, and
may be derived from a working subset of a larger corpus; both of these things may have
ramifications for data conversion, determining copyright, and recording provenance.
● data is often qualitative, textual, and/or narrative, which can make it harder to
standardise and structure than purely quantitative data
● research outputs often comprise heterogeneous collections of data types, making it even
more important to a) standardise formats as far as possible and b) include clear and
consistent metadata for describing the relationships between components of datasets
● data which has been collected is often recombined and restructured to generate new
patterns and allow new questions to be posed; it is often the complex structures and
interrelations between the datasets which provide the value, so it is vital that these are
preserved meaningfully
Humanities research projects also have commonalities of structure which will have implications
for data preservation strategies.
● as mentioned above, the value of humanities research tends to lie in the medium- to
long-term; understanding and documenting the context of the original data
capture/creation as early as possible is extremely important if its full value is to be
realised, as this information may be impossible to reconstruct by the time the data
comes to be used
● data which has been collected is often recombined and restructured to generate new
patterns and allow new questions to be posed; it is often the complex structures and
interrelations between the datasets which provide the value, so it is vital that these are
preserved meaningfully
● projects are often long-running (the longest studied in this project celebrated its
hundredth year in 2014), meaning that the technology used may well change several
times throughout the life of the project. If the project responds to this change then the
resulting data may have been through several conversions by the time it comes to be
ingested for preservation; if it does not, the resulting data may be in obsolete formats
which are difficult to understand and convert.
Another consideration is that there are very few national or subject-specific repositories
available for humanities data; one notable exception is the Archaeology Data Service
(http://archaeologydataservice.ac.uk/), and some humanities research materials may be eligible
for the UK Data Archive (http://www.data-archive.ac.uk/). Given this, Oxford’s institutional
repository services may be of much higher significance to researchers in the humanities than to
those in many of the sciences. A more detailed analysis of differences between research
practices across divisions has been conducted by the DaMaRo project
(http://damaro.oucs.ox.ac.uk/outputs.xml).
Dharma Project Final Report 7
J. McKnight; C. Madsen; J. Prag | November 2014
Assessment of existing tools and services
We investigated and tested (where possible) all existing tools and services that seemed to be
relevant to the processes of research data management and ingest for preservation. Some of
these are described below, but a full Glossary of RDM/Preservation Projects and Services is
also provided in Appendix 2.
ORA-Data
http://databank.ox.ac.uk/
ORA-Data is the institutional research data repository which is being developed by the Bodleian
Libraries, formerly known as DataBank but now rebranded to show more explicitly that it is part
of the Oxford University Research Archive (ORA), which to date has only contained article and
other ‘book-like’ research outputs. ORA-Data is live and contains data, but is still in a pilot
phase. It should be noted that all technical development of data services so far has been on
‘soft’ externally funded projects. Current work (summer 2014) is being supported by a
combination of the Bodleian Libraries and RCUK OA Project (where there is overlap with
publications). There has been a small amount of funding (1 FTE) for staff to support a
mainstream service.
ORA-Data is suitable for research data in any discipline and can accept data in any format for
deposit. Each dataset has a metadata record describing the dataset, and Digital Object
Identifiers (DOIs) are assigned to all data packages deposited in ORA-Data.7 Datasets can
either be made publicly available or embargoed for as long as necessary. ORA-Data’s other key
purpose is its role as a data catalogue for the University. The aim is to hold and display records
of data held in a location other than ORA-Data, be that within Oxford or in an external subject or
other data archive.
Depositing data
Deposit can currently only be undertaken by Bodleian staff, although an online deposit interface
is in development and is due to be released in Autumn 2014 (see below). Broadly speaking,
data can currently be ingested into ORA-Data in two ways:
7 A digital object identifier is a string of characters which uniquely identifies an object (e.g. a dataset or a
document). Metadata about the object is stored in association with the DOI and may include a location (e.g. a URL) where the object can be found; the DOI remains stable over the object’s lifetime, while the location and other metadata may change.
8 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
1. Manual, via web interface
A data package is created within a silo:
At this stage a minimal set of default metadata is created, following the DataCite8 core
(identifier, mediator, licence, embargo, rights, version, publisher, creation date). More metadata
can be added manually to the RDF manifest:
8 http://www.datacite.org/
Dharma Project Final Report 9
J. McKnight; C. Madsen; J. Prag | November 2014
and a file or files can be uploaded (this can include a zip file, which will then automatically be
unzipped in the package):
There are a number of problems and limitations with this method as it stands:
● manual metadata input increases the likelihood of error and inconsistency
● many users (including ITSS) will not have the resources to construct RDF by hand
● it is not possible to replace or delete metadata
● it is not possible to add separate metadata for individual files within a dataset (and
manually creating a separate package for each component is impractical for all but the
smallest datasets)
● the interface is not user-friendly; very little help or documentation is provided
However these issues are being addressed (see current and future work, below).
2. Programmatic, scripted
Databank (the software underlying ORA-Data) also has a REST API (documented at
https://databank.ora.ox.ac.uk/api/), and BDLSS developers have written a Python library making
use of this, which will form a useful basis for future software development. It should be noted
that while Databank itself is written in Python, the REST API is language-agnostic by its nature,
so interfaces could be developed using other languages.
If complex datasets are to have detailed metadata attached to their components at a level of
granularity lower than the dataset as a whole (and this is strongly recommended for effective
preservation and reusability) then these components will need to be ingested as separate but
linked packages, which will require programmatic ingest. This could work at a variety of levels of
complexity, e.g.
● bespoke scripts interacting directly with the Databank API
● customisable scripts for handling common forms of dataset
● a fully-featured interface allowing users (whether researchers or repository staff) to
organise a dataset as if in a standard filesystem, then export this structure as data
packages with associated metadata (see DataStage)
10 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
The first two of these would probably be written and/or configured by BDLSS staff or ITSS; the
third would make it more feasible for users to deposit their own data (though could also be used
by library/IT staff).
At the time of writing there were few real examples of programmatic ingest into ORA-Data. A
script was written to ingest data from the Great War Archive, but the code was written by a
contractor and has since been lost.
Accessing data
The web interface to ORA-Data allows for some browsing and searching of data packages:
The screenshot above shows the data packages from the Great War Archive as they appear in
ORA-Data. Such a view of the data emphasises the difference between data preservation and
the websites that are often built by (or for) researchers to access research data. The data
packages are more suited to machines than to humans: while the metadata is fully indexed,
there are no ‘browse’ features such as those on the project website
(http://www.oucs.ox.ac.uk/ww1lit/gwa).
Current and future work
Work in ORA-Data is ongoing and currently working towards:
● a clear and simple user interface for deposit and access to data, with a new facility for
creating and editing rich metadata
Dharma Project Final Report 11
J. McKnight; C. Madsen; J. Prag | November 2014
● Implementation of single sign-on (SSO) authentication for depositors
● developing the API for accessing the data (see also Common findings: ORA-Data API)
● transitioning to a robust, clearly defined service with a transparent charging model and
procedures to manage payment
● allowing ORA to serve as a catalogue and search interface to Oxford-generated
research, whether it is archived in ORA or elsewhere. This means that metadata-only
records will be created (whether added manually or automatically harvested) for
externally-stored research materials, and these will include a link to the location of the
dataset.
● allowing direct deposit from research data creation/management software, so that
deposit can be more tightly integrated into research workflows
● legal framework including terms and conditions and deposit licence
● policies to underpin the service
● Quantifying the staff required to run a mainstream service [recruitment dependent on
adequate resourcing]
● Helpdesk services including RT ticketing system staffed by BDLSS staff
● supporting documentation and ‘how to’ guides for depositors and users
● an implementation plan to steer development towards ‘EPSRC readiness’ for 1 May
2015
● scaleable infrastructure (dependent on adequate resourcing)
A workflow has been developed for ingest into ORA-Data9; this is intended for Bodleian staff,
but the (possibly optimistic) expectation is that researchers will eventually be able to do their
own ingest (or delegate it to e.g. research assistants).
Work is currently under way to add a more user-friendly form for submitting a record and
uploading data with a choice of either ‘rich’ or ‘simple’ metadata. Wireframes have been created
(an example screenshot is shown below) and these have been trialled with researchers from
two of the selected projects, providing useful feedback in both cases.
9 See Appendix 4: Workflow for uploading to DataBank as of 26/08/2014
12 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
Work is also under way to provide researchers with ways to deposit directly to ORA-Data from
the software they use to create and manage their research data, with ORDS (Online Research
Database Service) and DataStage both being steps towards this goal. However at the time of
writing neither is yet fully operational as an institutional service, and archival data export from
ORDS is still wholly manual. IT Services and the Bodleian Libraries are currently planning a joint
project to work on research data deposit from other systems, of which DataStage is a possible
example.
Other software
DataStage
http://www.dataflow.ox.ac.uk/index.php/about/about-datastage
DataStage is software developed as part of the JISC-funded Dataflow project
(http://www.dataflow.ox.ac.uk/) to help researchers manage their ‘active’ digital research data
prior to publication or archiving. It is a means for researchers to deposit selected data into their
data repository of choice (providing that repository complies with the SWORD2 standard).
DataStage is a secure personalised ‘local’ file management environment for use at the research
group or individual level, appearing as a mapped drive on the researcher’s computer. It can be
deployed on a local server, or on an institutional or commercial cloud.
Users save files to DataStage just as they would to an ordinary desktop drive (e.g. to their C:
drive), but with added extras:
Dharma Project Final Report 13
J. McKnight; C. Madsen; J. Prag | November 2014
● Private, shared and collaborative directories, with password-controlled access
● Web access – work securely with stored files over the web, anywhere in the world
● Users can add richer metadata via the web interface, using free-text ‘notes’ fields
● All files can be automatically backed up via the usual backup service
● Users can invite colleagues to access files made available to a defined group via
password control
● Repository submission interface makes it easy for researchers to define data packages,
enter minimal metadata, and deposit them in a data archive of choice
● Flexibility to dynamically invoke additional cloud storage as required
Current and future work
IT Services are investigating the feasibility of developing a central multi-tenanted DataStage
instance; a local instance has already been deployed in Zoology, and Classics are considering
another. DataStage is also being used with ORA-DataDatabank in DigitalSafe
(http://digitalsafe.wordpress.com/), an electronic archive pilot project at Oxford. None of the
DataStage projects to date have gone beyond a pilot stage, however, and its future is unclear.
ORDS (Online Research Database Service)
http://ords.ox.ac.uk/
ORDS is a hosted database service, accessed over the web, which allows researchers at the
University of Oxford (and their external collaborators) to create/import databases, to
add/edit/search the data, publish datasets, and (eventually) to deposit data into ORA-Data. It
allows multiple editors to work on a single database; access controls allow users to decide who
gets permission to do what.
Current and future work
ORDS is currently being rolled out to early adopters, but work is still needed in the following
areas: improvements to the user interface and documentation; handling complex structures and
different import formats; the ability to deposit directly to ORA-Data (this is currently a wholly
manual process); transitioning to a robust, clearly defined service with a transparent charging
model.
BEAM Web Deposit
http://beamwebdeposit.sourceforge.net/
BEAM (Bodleian Electronic Archives & Manuscripts: http://www.bodleian.ox.ac.uk/beam) has a
web service which allows depositors of electronic archival material to upload their data to a
dedicated silo of ORA-Data for archiving, for datasets of up to 4GB. It offers a simple upload (for
single files, or compound files such as zips and tars) as well as a Java applet which allows
14 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
users to edit their upload as they go, for example to exclude certain files or directories. Users
are asked to supply a very small amount of metadata, and both they and the repository receive
an email receipt to confirm what was transferred. Administrative functions provided for BEAM
include: user management; the ability to alter settings for the user receipt issued on deposit; and
the ability to view details for transfers in the BEAM landing zone which are ready for transfer to
the BEAM preservation store.
Current and future work
This service is currently out of action following a server crash in November 2013; there are
plans to restore it, but currently no timescale or resources allocated for this.
External repositories
The University actively encourages researchers to deposit their data in external discipline-
specific repositories where appropriate ones exist in their field; metadata records can be created
for these in ORA-Data (either by harvesting or by manual submission) so that the data will
appear in local catalogues of Oxford University research materials and then link to the data
source for access. The initial focus for archiving datasets in ORA-Data is on data that
1. underpins publications, and/or
2. has to be archived as a requirement of a funding grant, and
3. where there is no other suitable archive for deposit.
It should be noted however that there are very few national subject repositories for humanities
subjects; the main exception is the Archaeology Data Service
(http://archaeologydataservice.ac.uk/), but obviously this only serves one specific area of
humanities research. The UK Data Service (http://ukdataservice.ac.uk/) may also be appropriate
for some humanities data, though none of the projects we worked with would fall into this
category.
3rd-party software
Many departments, faculties, and individual research groups and projects have already
built/commissioned (or are planning) their own systems and services for data management,
creation, and preservation, built on third-party software with varying degrees of local
configuration and customisation. Some examples:
● Archaeology have a rapid prototyping system for database-driven research
websites/applications, built on Filemaker Pro Server;
● IT Services, Bodleian Libraries Digital Library Systems and Services (BDLSS) and
departmental ITSS have each built bespoke research databases and collections for
many projects.
Dharma Project Final Report 15
J. McKnight; C. Madsen; J. Prag | November 2014
We recommend that central services (the Bodleian, IT Services) should keep lines of
communication with ITSS (and with each other) open, learn from their experience, and
interoperate with existing local systems where it is useful and possible to do so.
Other commonly used tools
Researchers are of course not limited to the software offered by the University, and will often
use the tools they have to hand, particularly in the early stages of a project where the
format/scope of the data may not yet be clearly defined. It is important that we support them in
this freedom while advising on general principles of effective data management with long-term
preservation in mind, including risks they should be aware of and steps they can take to ensure
that their data remain within their control, ‘portable’, accessible, and reusable.
Examples of software used by researchers and ITSS to create, edit, and share research data:
● database software (e.g. Access, MySQL, Postgres, Filemaker Pro)
● spreadsheets (e.g. Excel, OpenOffice)
● word-processing (e.g. Microsoft Word, Google Docs, OpenOffice)
● file-sharing (e.g. Dropbox, Google Docs)
● web content management systems (e.g. Drupal)
None of these are specific to research data, but it is important that our general support of third-
party software is joined up with specific support concerning how that software is used in a
research context. Two risks arise from this situation: first, that without sufficient or suitable
central resource, people will invest money in third party software and/or invent their own
solutions, which may be unsuitable, unscaleable, or incompatible; second, that without sufficient
communication across the various sections of the University, practices and developments will
continue to diverge despite the existence of central resources
Other projects in development
A full investigation of all relevant third-party software would be impossible, and was outside the
scope of this project; however we did work with IT Services and OSS Watch to conduct an
investigation into the suitability of CKAN (http://ckan.org/: an open-source data management
system which has been adopted by a number of UK HE institutions) for RDM at Oxford. This
identified a number of functions which we do not currently provide to researchers, such as:
● Preview simple data in the research data repository
● Allow simple HTML publishing alongside archived data
● Persistently address data at arbitrary levels of granularity for citation
● Easily link research data with research information
● Provide a personalised and customisable ‘presence’ for individuals and research groups
● Create & visualise explicit links between files in a dataset
16 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
● ‘Deep search’ within datasets (not just searching metadata)
● Simple way to generate citation text
Following on from this investigation, IT Services are bidding for 2.5 years’ funding to set up a
CKAN service. Further details of both the investigation and the funding bid are provided in
Appendix 6: investigation of CKAN for RDM at Oxford.
Service providers
Research Data Management single point of contact
http://researchdata.ox.ac.uk/
This year saw the launch of a new website, http://researchdata.ox.ac.uk/, intended to support
researchers in sharing, managing, and preserving their data and research materials. The site
aims to provide a starting point for answers to questions about data storage, organisation, and
preservation; funder requirements; and available training; and to aggregate useful links to
external resources such as sample technical plans, checklists, funder requirements, etc. It also
publicises the new ‘single point of contact’ email address ([email protected]) for an ‘RDM
Enquiries team’ comprising representatives from the Bodleian Libraries, e-Research Centre, IT
Services and Research Services. This ‘team’ is entirely virtual, members are not co-located but
co-ordinate their responses online.
Research Services / Research Accounts (UAS)
http://www.admin.ox.ac.uk/researchsupport/
Research Services is the central research administration support service, providing
comprehensive support to researchers across the research lifecycle. Their services include
supporting the grant process, negotiating research-related contracts and agreements, providing
information on funding opportunities, helping to ensure compliance with regulatory and sponsor
requirements, facilitating technology transfer, supporting the University’s knowledge exchange
activities.
Bodleian Libraries Digital Library Systems and Services (BDLSS)
http://www.bodleian.ox.ac.uk/bdlss/
The Bodleian Digital Libraries Systems and Services works in collaboration with scholars and
librarians at Oxford and around the world on the most pressing issues facing researchers in the
digital age, including digital preservation and the capture and delivery of digital research data;
they also have an active digitization programme that aims to make some of the Bodleian’s rare
materials available globally for learning, teaching, and research. As well as managing ORA and
Dharma Project Final Report 17
J. McKnight; C. Madsen; J. Prag | November 2014
ORA-Data, BDLSS provides hosting, development and maintenance for many bespoke
research websites and collections.
Academic IT Research Support team (IT Services)
http://blogs.it.ox.ac.uk/acit-rs-team/
The Research Support team provides IT-related advice to researchers (e.g. assistance with
technical appendices for funding bids; developing data management plans; advising on suitable
software or storage solutions for research), undertakes technical development work, and
provides a variety of training events in the area of research data management. The Research
Support team offers an initial consultation free of charge; other services are charged at a set
rate.
Infodev (IT Services)
http://www.it.ox.ac.uk/infodev/
IT Services' academic development team. InfoDev provides IT support, development and
consultation to help facilitate and disseminate research, support teaching and learning, and
increase access to museums and collections through the development and hosting (via NSMS)
of web applications; it also partners with academic research projects on funding bids (providing
guaranteed staff time, consultation, or undertaking specific work-packages) and offers help and
assistance in writing technical appendices to such bids. InfoDev offers an initial consultation free
of charge; other services are charged at a set rate.
IT Learning Programme (IT Services)
http://www.it.ox.ac.uk/itlp/
The IT Learning Programme offer training courses on a wide variety of IT topics, including topics
and methods directly relevant to RDM and the digital humanities (e.g. database design,
reference management systems, surveys), as well as co-ordinating the more specific RDM
training offered by Academic IT’s Research Support team.
Subject Librarians
Subject librarians are sometimes approached with questions about RDM and preservation; in
recognition of this, the Bodleian has created the position of Data Librarian in the Social
Sciences (the holder of this post also acts as a subject librarian for Sociology, Economics and
Social Policy & Intervention). The Bodleian Data Priority Group, comprising subject librarians,
members of BDLSS and others, has been convened to ensure subject specialists are informed
of data developments and to act as a point of contact between researchers and BDLSS.
18 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
BEAM (Bodleian Electronic Archives and Manuscripts)
BEAM provides a trusted digital repository service for the management of born-digital archives
and manuscripts acquired by the Bodleian Library’s Special Collections department. The
repository allows the Bodleian’s archivists to gather, describe, manage and preserve the digital
components in archive and manuscript collections while maintaining their relationship with more
traditional components of the same collection. While there is considerable overlap here with the
issues involved in archiving research data, BEAM currently do not provide a service to
academic departments or directly to researchers.
BEAM staff have been involved in the Digital Safe project (see Glossary), again looking at
similar issues but in the context of institutional data.
Faculty and Department ITSS
The faculties and departments that make up the Humanities Division have widely varying levels
of IT provision.
Methods & Activities
Semi-structured interviews with researchers
At the time of application for funding an open call was made seeking participation from Digital
Humanities projects willing to participate in this project. Fourteen projects volunteered, of which
thirteen were eventually able to take part. Within each of the self-selected projects (see
Introduction: The projects), we conducted interviews with project researchers (generally the PI
on each of the projects) to determine:
● the aims and scope of their research
● their data creation or collection methods
● their data management practices as a whole, including organisation and documentation
of data, hosting, storage, data entry or editing, preservation plans, backups, etc
● the current state of their data in a wider research context, including citations, related
publications, collaboration with other projects, reuse, etc
● their views on whether or how the University was meeting their RDM requirements
● their understanding of digital preservation and sustainability, both generally and in the
context of their project
We took a semi-structured approach to these interviews rather than administering a fixed
questionnaire or survey, and the resulting conversations were invaluable in foregrounding the
real issues and challenges faced by digital humanities academics. A full description of each
project and the data gathered can be found in Appendix 1: project case studies.
Dharma Project Final Report 19
J. McKnight; C. Madsen; J. Prag | November 2014
Conversations with IT Support Staff
We also conducted discussions with several of the IT Support Staff (ITSS) from the academic
departments involved in the selected projects, who had worked to support these research
projects in some capacity. These were similar to the conversations with the researchers but with
a more technical focus. Again, allowing IT staff to talk freely proved rewarding, and helped to
build up a rich picture of the wide variety of experience and knowledge that local support staff
are currently providing.
The level and type of local (departmental) IT support varies widely between units, but services
provided to researchers by ITSS usually include some or all of the following:
● website/database development
● website/database hosting
● managing backups of desktop computers and project websites
● supplying, configuring, and maintaining third-party software
● advising on technical components of research funding bids
There is considerable overlap here with the services provided centrally by IT Services and the
Bodleian. This is not in itself a problem – local expertise can be valuable, and in any case
central resources are insufficient to meet all demand – but highlights the need for
communication between providers to ensure interoperability and long-term sustainability.
Despite the variety of experience, some common ground emerged:
● Most ITSS were dealing with several research projects at any one time, at various
stages of the research lifecycle (some ‘live’, i.e. current funded projects; some ‘legacy’,
usually only maintained on a best-effort basis).
● Faculty ITSS were generally deeply knowledgeable about the projects for which they
were responsible, with both an understanding of the technologies used (even in legacy
projects which they had only ‘inherited’) and an appreciation of the research aims.
● Despite that sense of investment in the research, ITSS inevitably had different priorities
from researchers, and in many cases we sensed a tension between the desire for
standardisation (on the basis that a reliable service can be more easily and efficiently
provided by supporting only a limited set of technologies) and an appreciation of the
importance of meeting individual researchers’ requirements (often by building and/or
maintaining bespoke systems).
● A frequently recurring issue was the lack of consultation about technical requirements for
research projects; ITSS reported that they were often only informed of requirements
once a project had already been funded (i.e. once they had effectively been committed
to providing hosting and/or development resources). In one extreme case the first ITSS
involvement or indeed knowledge of a project was a request for a firewall exception from
a contractor who was developing the project website.
20 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
Data acquisition
One of the initial goals of this project was to gather research materials from the self-selected
projects for ingest into the central repository (ORA-Data). These datasets comprised a variety of
file formats:
● Databases: MySQL; PostgreSQL; Access; Filemaker Pro
● XML texts and metadata: TEI; EpiDoc; bespoke schemas
● Images: TIFF; JPEG
● Video: MPEG; MP4
● Audio: MP3
● Word processing: .docx; .doc; WordPerfect
● PDFs
Several projects had physical media and non-digital assets associated with them, such as:
● CDs, DVDs
● audio recordings on analog cassettes
● paper transcripts of interviews
● slides, microfilm, microfiche
● museum objects
● notebooks
● archaeological finds (pot sherds etc)
Archiving and cataloguing these assets fell outside the scope of this project, but the existence of
these forms of research data have flagged the need for facilities for recording the existence and
location of related physical assets when ingesting digital assets into a repository. It also raised
the question of whether there is need for a clearer policy on whether the humanities equivalent
of ‘lab notebooks’ and other pre-publication working data is in scope for the institutional
repository (and if not, where or whether these should be preserved).
The main obstacles to data acquisition were:
● the simple logistics of liaising with so many different researchers and ITSS
● the availability of data holders – both researchers and ITSS had higher priorities
● technical issues with exporting data in a useful format
The main obstacles to data deposit were:
● the repository software – a limited interface for ingest and creation of metadata
● some datasets turned out to be so complex that formatting and describing them for
preservation would be a project in its own right
Dharma Project Final Report 21
J. McKnight; C. Madsen; J. Prag | November 2014
As there was no user-facing process or service for uploading data into ORA-Data during the
course of this project, files were transferred to project staff by combinations of the following:
● Oxfile file-sharing service (https://oxfile.ox.ac.uk/)
● portable hard drive
● CD/DVD through internal mail
● standard file compression software (gzip, tar)
Note that none of these tools allow for verifying the integrity of the data, i.e. ensuring that the
data is transferred without truncation or corruption; verification was performed manually by
checking file sizes and number of files against the information supplied by the project PI. This
sufficed for the current proof-of-concept project, but future development of systems for
submitting research data should take this requirement into account.
Despite efforts, much less data than intended was gathered. The main barrier to data
acquisition was not technical but social and logistical. Where research materials were stored in
databases to which only local IT staff had access, or even on researchers’ personal laptops, the
process of extracting data required considerable pro bono effort from already over-committed IT
staff and researchers. In all cases personal contact was required to gather data, and in most
cases this became a protracted discussion to establish the answers to questions such as what
data were currently held, what needed to be preserved, whether embargoes were required,
where copyright resided, and in what format(s) data could be exported. It is expected that as a
Data Management Plan (DMP) becomes a requirement for funding, some of these questions will
be answered earlier in the project lifecycle; none of the projects consulted had a DMP.
Data ingest and metadata creation
Once gathered, the data was ingested to ORA-Data through the manual browser-based process
described above, creating only the minimal metadata demanded by this method. While manual
creation of further metadata would have been possible, it would have been labour-intensive,
error-prone, and almost certainly inconsistent with the richer metadata which the new input
forms being developed (see ORA-Data above) will facilitate.
We did, however, investigate the possibility of creating additional targeted metadata extracted
from the content of datasets, namely bibliographies and the temporal, prosopographic, linguistic,
and geographic extent of data. This was also, inevitably, a largely manual process. While there
is scope for partial automation, e.g. template scripts for processing different data formats,
manual configuration would probably always be required to determine the correct fields to use
for e.g. person, place, or date information, and the potential gains (cross-searching, comparing
multiple datasets, visualising the extent of data) could as efficiently be addressed by more
standardisation of formats in the repository and better indexing/search within datasets.
22 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
Documentation was ingested alongside data where possible, but this was often patchy and
never machine-processable; the latter is not a problem per se (human-readable documentation
will be necessary and useful for as long as humans are involved in the reuse of data for further
research!) but it might be useful to experiment with template systems and ‘toolkits’ for
encouraging more structured documentation of data as research progresses: e.g. for databases,
a pre-populated form where a brief description of each field can be entered; for XML, achieving
a similar goal by automatically inserting comment fields at key points in the schema; for multi-
format datasets, automatically documenting the file types and any folder hierarchy, leaving
space for human explanations of the significance of those properties.
Case studies: research projects
A full description of each of the cases can be found in Appendix 1: project case studies, but in
presenting the findings in this report, we decided to focus on one particular case, which turned
out to exhibit a wide range of characteristic issues for digital humanities research data, and to
use this as a starting point from which to highlight the same issues found in other projects.
These points of overlap are highlighted in the blue boxes.
Sphakia Survey
Overview
● Website: http://sphakia.classics.ox.ac.uk/
● Summary: Databases of finds recorded in an archaeological survey of Crete
● Project lifetime: 1987–
● Formats: Filemaker Pro; JPG; TIFF; .mov; HTML
● Funding: Social Sciences and Humanities Research Council of Canada; Institute for
Aegean Prehistory (New York); Craven Committee, University of Oxford
Scope of research
The Sphakia Survey is an interdisciplinary archaeological project, begun in 1987, whose main
objective was to reconstruct the sequence of human activity in a remote and rugged part of
Crete (Greece) from the time that people arrived in the area, by ca 3000 BC, until the end of
Ottoman rule in AD 1900. The research covers three major epochs, Prehistoric, Graeco-Roman,
and Byzantine-Venetian-Turkish, and has involved the work of many people using
environmental, archaeological, documentary, and local information.
Four main types of information are recorded: the artefacts discovered in the surface survey
which was conducted; the environmental data that records the context in which they were found;
text and inscriptions found; and oral/ethnographic data (notes from interviews; photos of sites
and surrounding area). The information is categorised into different regions and environmental
zones, and individual artefactual finds are classified by shape, materials, colour and so on.
Dharma Project Final Report 23
J. McKnight; C. Madsen; J. Prag | November 2014
The project represents nearly 30 years of academic research; the archaeological survey is
inherently unrepeatable and the subsequent analysis would be at best costly and at worst
impossible to replicate.
Research materials
Databases
All of the information collected for the Sphakia Survey was recorded in a set of databases. The
survey databases were created in Filemaker Pro. One set of linked databases (focusing on
Region 8 of the survey, as a case study) is accessible via the website through a bespoke PHP
interface, running on the following setup:
● Hardware: PowerMac G3 server (warranty ended in 1999)
● OS: Mac OS X 10.3
● Webserver: 4D WebSTAR
● Database: Filemaker Pro version 5.5
This version of Filemaker Pro is long since out of support and has no direct upgrade path to a
current version; the hardware is long overdue for upgrade. As a whole the project’s web
presence is in urgent need of intervention to safeguard its long-term survival.
In addition to the online databases, 29 further databases (mostly concerning detailed
macroscopic fabric analysis of the finds from the survey) exist only offline in Filemaker Pro
version 4.1 format on the researchers’ personal computers. This version of Filemaker Pro, still
being used by the project team, requires Mac OS 9 (or ‘Classic’ environment to emulate this) to
run, which in turn requires a PPC Mac running OS X 10.5 or earlier10: this hardware has not
been supported by Apple since 2009. In order to export these databases into a reusable format
they would first have to be updated (via a two-stage migration process) to a modern version of
Filemaker Pro.
Unsupported software: also found in one other project
Server setup at risk: also found in one other project
There are at least three copies of these databases (on the PI’s laptop; on an external hard drive,
as a backup copy; and on the Co-PI’s computer) and as no version control is in place it is quite
likely that the copies are no longer in sync and would have to be intelligently merged. This
operation is complicated by the fact that Filemaker’s “last modified” timestamp is updated
whenever a database is queried, so programmatically determining the most recent substantive
changes may be difficult or even impossible.
10 Current Mac OS is 10.9
24 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
Version control issues: also found in one other project
Image & video
One database includes scanned black and white drawings of pottery finds (copyright status
uncertain: the originals are owned by the Canadian Institute in Greece but they were not
responsible for the digitization and are considered to be unlikely to assert their rights); there are
also a large quantity of photographs (low resolution versions are mostly on the project website,
but higher resolution versions exist offline). Photographs have unique IDs and associated
captions/descriptions are recorded in the database. Video clips (.mov/.rm format) are available
through the website; these are taken from a 50-minute film, originally in VHS format, since
digitised.
Copyright status in images: also found in one other project
Different quality versions of images: also found in one other project
Website and documentation
Technical documentation for the website and databases is believed to exist but project staff
were not able to find or obtain it within the duration of the project.
The website provides detailed explanations of the project aims, a summary of results, how to
use the database, and so on. It is regarded by the researchers as a publication in its own right,
to be preserved alongside the data.
The website also hosts a number of scholarly publications (HTML format) and newspaper
articles (reproduced as JPG and PDF), as well as a bibliography of articles published
elsewhere.
Documentation embedded in website: also found in two other projects
Physical media
In addition to the digital data, the PI has custody of the following media associated with the
project:
● Original paper drawings of pottery finds
● 2 copies of VHS video (as described at http://sphakia.classics.ox.ac.uk/video.html)
● Approximately 45 CDs of digitised photos, slides, & drawings (in the form of TIFFs, jpeg
derivatives for the website, and thumbnails of these), plus course materials for
Dharma Project Final Report 25
J. McKnight; C. Madsen; J. Prag | November 2014
Archaeology for Amateurs: The Mysteries of Crete, a course created for TALL in 2002
(http://crete.classics.ox.ac.uk/)
● Approximately 1000 35mm slides (some of which were digitised for the website)
● 1 reel of microfilm
● 1 box of microfiche
Except for the cataloguing/captioning of photographs for the website, none of these resources
have been catalogued or documented.
Further non-digital resources
The pot sherds from the archaeological survey are currently housed in the Archaeological
Museum in Khania (West Crete), and are owned by the Greek Archaeological Service. The
site/number recorded in the database can be used to identify the physical object.
Handwritten notebooks from the survey also exist; these have not been catalogued or digitised.
A two-volume publication based on the finds of the survey is currently in progress, expected in
2017.
Associated non-digital assets: also found in three other projects
Linked publications: also found in six other projects
Other projects
Summary information only is included here for the remaining projects; more detail is available in
Appendix 1: project case studies.
Project 1: Ashmolean Cyprus Digitisation Project
Website: http://digital.humanities.ox.ac.uk/ProjectProfile/Project_page.aspx?pid=333
Department: Ashmolean Museum
Summary: Database of description/provenance for Ashmolean Cyprus collection
Formats: Zetcom MuseumPlus relational database; TIFF; JPG
Funding: A. G. Leventis Foundation
Project 2: Centre for the Study of the Cantigas de Santa Maria
Website: http://csm.mml.ox.ac.uk/
Department: Faculty of Medieval and Modern Languages
Summary: Metadata database for manuscripts of medieval Galician poems, plus some full texts
Formats: MySQL database, linked PDF and XML texts
Funding: Leverhulme Trust; Research Development Fund of Oxford University; MHRA; British
Academy
26 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
Project 3: Creative Practice in Contemporary Concert Music
Website: http://www.music.ox.ac.uk/research/cpccm/
Department: Faculty of Music
Summary: Video/transcripts of rehearsals; database of quantitative data derived from these
Formats: mp4, .docx, relational database
Funding: AHRC
Project 4: Dictionary of Medieval Latin from British Sources
Website: http://www.dmlbs.ox.ac.uk/
Department: Faculty of Classics
Summary: Complete dictionary of medieval Latin digitised as XML
Formats: XML (bespoke schema)
Funding: AHRC; Packard Humanities Institute; British Academy; John Fell Fund
Project 5: Digital Miscellanies Index
Website: http://digitalmiscellaniesindex.org/
Department: Faculty of English
Summary: Bibliographic data about 18th century poetic miscellanies, stored as XML
Formats: XML (bespoke schema)
Funding: Leverhulme Trust
Project 6: Early Modern Festival Books
Website: http://festivals.mml.ox.ac.uk/
Department: Faculty of Medieval and Modern Languages
Summary: Database of bibliographic information about early modern festival books
Formats: MySQL relational database
Funding: John Fell Fund
Project 7: First World War Poetry Digital Archive / Great War Archive
Website: http://www.oucs.ox.ac.uk/ww1lit/, http://www.oucs.ox.ac.uk/ww1lit/gwa/
Department: Faculty of English / Oxford University Computing Services
Summary: Partly community-sourced collection of multimedia objects with metadata/provenance
Formats: TIFF, JPEG, MP3, MPEG, TEI XML, txt, csv
Funding: JISC; HEFCE
Project 8: Inscriptions of Sicily
Department: Faculty of Classics
Summary: Epigraphic database of Sicilian inscriptions
Formats: currently Access database, being reworked as EpiDoc XML
Funding: John Fell Fund
Project 9: Last Statues of Antiquity
Website: http://www.ocla.ox.ac.uk/statues/
Dharma Project Final Report 27
J. McKnight; C. Madsen; J. Prag | November 2014
Department: Faculty of History / School of Archaeology
Summary: Database of information about statues, plus digitised photographs
Formats: Filemaker Pro database; JPG
Funding: AHRC
Project 10: Lexicon of Greek Personal Names Online
Website: http://www.lgpn.ox.ac.uk/online/
Department: Faculty of Classics
Summary: Prosopographical database of ancient Greek names, based on paper publication
Formats: ingres database, TEI XML, RDF (CIDOC-CRM), postgres database, PDF
Funding: AHRC
Project 11: Oxford Archive of Russian Life History
Website: http://www.ehrc.ox.ac.uk/lifehistory/archive.htm
Department: Faculty of Medieval and Modern Languages
Summary: Audio recordings & transcripts of ethnographic interviews, plus photographs
Formats: Microsoft Word docs, WordPerfect, mp3, JPG
Funding: Leverhulme Trust; AHRC
Project 12: Oxford Roman Economy Project
Website: http://oxrep.classics.ox.ac.uk/
Department: Faculty of Classics
Summary: Several linked quantitative databases relating to Roman economics and trade
Formats: postgres database, CSV, Word docs
Funding: AHRC; Augustus Foundation (Baron Lorne Thyssen)
28 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
Common findings
The research in this project identified a number of common themes or concerns which rarely
have clear solutions. The recommendations below are derived directly from the evidence
gathered.
Finding 1: Data v interfaces
Although a distinction is sometimes drawn between preservation (long-term storage of data for
reuse) and sustainability (keeping the data available online, e.g. via a dynamic website), this is a
distinction which often proves too subtle for most researchers.
There is a considerable (and wholly understandable) lack of clarity among researchers around
the difference between their research data (for textual data, most commonly held in either a
relational database or increasingly as XML) and the interfaces to it (usually a website with some
kind of search interface). In some cases this confusion is not aided by the construction of
systems and applications which blur the line between data and interface: user documentation
and logic, essential to the understanding and use of the data, are embedded in the interface
software, making it harder to extract the data for preservation.
All of the academics consulted started from the point of view that the optimum method of
preservation would be to “keep the website working”; some took a stronger position that the
data were meaningless without the interface. This strong preference for sustaining applications
over preserving data in isolation stemmed from a firm belief in the current importance of these
datasets for the research community – that is, that if the data are still ‘in use’ then they should
not be ‘archived’. This, in turn, came from a strong perception of ‘archiving’ as involving taking
data out of circulation, mothballing it, making it inaccessible, marking it as less current or
relevant; this prejudice is not irrational but should be borne in mind when communicating with
researchers about preservation.
▶ Recommendations: 1. Preservation and sustainability
Finding 2: Funding gaps for sustainability and preservation
As mentioned above, most research funding is allocated to a fixed-term project and can only be
used to cover costs incurred during that period; this means that it can be difficult to fund the
ongoing maintenance of a website, database, or other digital resource. This is a particular
problem in the humanities where the scholarly benefits of resources generally accrue over the
medium- to long-term, and means there is an increased risk of technical maintenance having to
be done ‘under the radar’, on a ‘best effort’ or pro bono basis, by IT staff (or indeed researchers)
who have no time ring-fenced for the work. This in turn means that:
Dharma Project Final Report 29
J. McKnight; C. Madsen; J. Prag | November 2014
● staff are responsible for an exponentially increasing set of projects and research data
● single points of failure are more likely, e.g. where a resource relies heavily on a single
individual’s expertise and goodwill
● maintenance is likely to be more reactive than proactive, concentrating on ‘firefighting’ to
keep a resource alive in the short-term rather than working towards its long-term
sustainability
● the true cost of maintaining the resources is therefore obscured and continues to be
underfunded
However, the research found that it was rare for any research or IT staff to take active steps to
close down or discontinue a digital resource even when it is no longer funded, so this
maintenance debt just increases: this is clearly not sustainable without increasing resources to
meet demand, and it is recommended that more active curation of research outputs is
undertaken. It should be noted that the University RDM policy states that “research data and
records should be retained for as long as they are of continuing value to the researcher and the
wider research community”; at present it is not clear who has the authority to make that
judgement.
The combination of the explicit cost of preservation (and the difficulty of including this in a
funding bid) and the hidden cost of ongoing maintenance may mean that research materials
stay in a grey zone of best-efforts sustainability rather than being consciously and effectively
preserved; this puts them at greater risk of being lost altogether, as a) outdated and unpatched
software may be more vulnerable to deliberate attack or accidental failure, and b) expertise may
be lost as knowledgeable individuals who have been maintaining ‘unofficial’ resources out of
goodwill may move or leave without proper handover. Essentially, it is vital to be explicit about
what resources (data and interfaces) exist; what their current status is (e.g. development, live,
maintenance, archive); and who ‘owns’ them, i.e. who takes responsibility for their maintenance,
preservation or deletion.
▶ Recommendations: 2. Funding gaps for sustainability and preservation
Finding 3: Lack of reuse and evolution
The ability to reuse and build on existing data for future research is one of the most common
arguments for preservation; however, good examples of effective data reuse in the digital
humanities are comparatively hard to identify (though some are described below). We believe
there are two simple reasons for this:
● research materials are frequently not preserved in a form where they can easily be
discovered and reused
● digital humanities is still a relatively young field, particularly given the generally long
gestation periods for humanities research
30 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
However there is also a third, more paradoxical effect that, because digital humanities projects
tend to evolve over the long-term, their research materials are in a sense continually being
reused by the researchers involved in collecting/creating them. This has problematic
implications for reuse by others because:
● it can be difficult to motivate investment of time and resources in documentation while
projects are still ‘in progress’, as this is seen as an activity to be carried out when the
project is ‘finished’ (and as such may never actually happen)
● projects often grow organically over the course of decades, converting data and
adopting new technologies according to their needs (and as the technology changes
around them, the research may change to make use of new capabilities – the tools
influence the research), resulting in more different technologies to document
If research data are to realise their full potential as re-usable research materials, it is vital that
they are preserved in a way that allows them to be discovered, accessed, and used by future
researchers. Discovery and access rely first and foremost on good metadata, which needs to be
created at the point of preservation. Even the most perfectly preserved and documented
research materials are of little use if nobody is aware of their existence in the first place. The
‘usability’ of an existing dataset depends on recording the original context of the data
capture/creation alongside the data itself. In some cases there may be several different points of
capture and/or conversion and it may be necessary to preserve or document the project’s
history rather than simply a snapshot of the current/finished state of the data. Examples of
existing projects which demonstrate reuse are:
Digital Miscellanies Index: in the course of the original project, data has been optimised for
reuse and preservation; this has enabled phase 2 of the project, which will combine the original
data with that from the Verse Miscellanies Online project
(http://versemiscellaniesonline.bodleian.ox.ac.uk/) and from another ‘orphaned’ database,
creating mappings between the three datasets to allow effective cross-searching and
visualisation of verse miscellanies across a much broader historical period.
Lexicon of Greek Personal Names and SNAP:DRGN: the LGPN has digitised, converted and
normalised data to make it more widely available and reusable, and SNAP:DRGN defines
standards for making prosopographies of the ancient world (including the LGPN) available as
open linked data. While neither of these in themselves constitute reuse of data for novel
research, both make strong contributions to the work of enabling reuse.
It should be noted that all these examples of effective reuse involve strong collaborations with
other institutions; any proposed institutional data management solutions must meet the
requirement for data sharing and collaborative working outside Oxford as well as within it.
▶ Recommendations: 3. Reuse and evolution
Dharma Project Final Report 31
J. McKnight; C. Madsen; J. Prag | November 2014
Finding 4: Advice, training, and mentoring
Almost every respondent interviewed stressed the value of conversations in person with
research technologists (in IT Services, in the Bodleian, and in faculties/departments) for all
aspects of creating/collecting, managing, developing, and disseminating their data: selecting
tools and designing methods for data collection; data analysis and visualisation; data
conversion; developing user interfaces; and so on. These dialogues offered a chance to explore
possibilities and work through the advantages and disadvantages of possible approaches rather
than presenting a single technical ‘solution’.
At the outset of a project researchers may not have a clear idea of what their eventual data-
management requirements will be; they may not know what technological options are available
to them and what the longer-term consequences of those options may be; and they may not
know what questions they need to ask to fill in these gaps. More open-ended conversation with
data management and research technology experts helps to build up a fuller picture of the
research data and related resources, how the data has been managed to date, and how best to
go forward with long-term sustainability and preservation in mind.
Ideally this sort of conversation should happen at the beginning of a project – expert support
early on can highlight potential problems, suggest possible developments and enhancements,
and identify areas for possible collaboration with other projects. In this last area it is important to
note the advantage of not having a separate humanities-specific research support area: that is,
central support services can identify areas where knowledge and technological development
can be shared between projects in different disciplines. This can result in potentially fruitful
interdisciplinary collaboration, as well as simple efficiency.
While researchers had much praise for the support given to them by individual members of staff
and by departments as a whole, however, they also expressed frustration at:
● Receiving conflicting advice from different sources
● Different technologies being supported by different departments/services – some insist
on migrating/rewriting when inheriting an application from elsewhere
● Lack of clarity about how the different central services interact
Even within any one central department (IT Services or the Bodleian Libraries) there are several
different sections which might be asked to give advice on research technologies and data
management (see Relevant service providers for more details of these services), and these
sections operate largely independently of each other. While personal conversations are
important, then, they need to be with people who are aware of the available options, and as part
of a joined-up approach.
Some steps have already been taken to address these issues, notably:
32 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
● a new RDM website (http://researchdata.ox.ac.uk/) and email address
([email protected]) offering a single point of contact for RDM enquiries (see
Relevant service providers)
● a training programme offered by Research Support at IT Services
(http://blogs.it.ox.ac.uk/acit-rs-team/events/rdmcourses/) promoting consistent data
management principles while tailoring its message to specific disciplines
While these are welcome beginnings and necessary underpinnings for increased harmonisation
of service provision, it is clear from conversations with academics that more passive, static
information (websites, formal training sessions) will not meet all their requirements and that
personal dialogues with experts are still vital. It should also be noted that no ‘single point of
contact’ can prevent people accessing services via other points; what is needed is more of a
clearing-house for projects and information, however and wherever the first contact is made.
Several researchers independently suggested that ‘digital humanities mentoring’ would have
been useful to them, and some of these indicated that they might be willing to act as mentors in
the future. A pilot ‘RDM mentoring’ project is currently under way in the Social Sciences (with
mentoring initially being offered by the Research Support team at IT Services), and has already
garnered considerable interest from academics.
▶ Recommendations: 4. Advice, training, and mentoring
Finding 5: Repository policy on data and formats
In general researchers interviewed for this project were aware a) of the central publications
repository (though few had used it), and b) that a central data repository (ORA-Data) either
already existed or was imminent. There was, however, much less clarity about the purpose of
ORA-Data, i.e. whether it was intended for:
● discoverability (search or serendipity)
● actively promoting the diversity of Oxford research (marketing our research)
● archiving/preservation (ensuring nothing is lost)
● backups (ensuring data and interfaces can be efficiently restored in event of a crash)
● accountability (e.g. fulfilling the REF)
● compliance (e.g. meeting funder requirements)
● sharing data as a principle (Open Access)
or some combination of the above. There were also questions and concerns expressed by
researchers about what data is desirable and permissible within ORA-Data, e.g. should/could it
include:
Dharma Project Final Report 33
J. McKnight; C. Madsen; J. Prag | November 2014
● all data produced by Oxford researchers?
● data associated with publications only?
● data which has no other appropriate (e.g. subject-specific) repository? (NB a significant
majority of humanities data is believed to fall into this category, far more than in other
disciplines)
● ‘unfinished’ or unfunded data? (e.g. data which is unpublished, incomplete, not part of a
project)
A deposit policy is currently being drafted11 and these questions are being addressed.
As well as the general question of what categories of research materials are permitted or
invited, there was a more specific technical question of what file formats the repository would
accept. Reasons for selecting formats and technologies are often contextual rather than
functional; there are usually many different options which could meet researchers’ immediate
needs, but they are arguably most likely to use:
● what they have always used
● what is most readily available
● what colleagues in their field use, or
● what their local ITSS prefers/mandates for ongoing support
In many cases these choices may have been based on business considerations and/or
historical accident as much as technical or methodological reasons.
Under Oxford’s devolved structure no-one has the power to prescribe (or proscribe) what
formats are accepted for preservation; however clear guidelines on which formats we can most
easily support and preserve should be clearly disseminated, along with advice on which formats
are most appropriate for enabling future reuse. In practice we can always ‘preserve the
bitstream’12 even if the format is suboptimal or the data is poorly understood/documented, but
making such data reusable may not be cost-effective.
File formats are of course not institution-specific, and various organisations already provide
comprehensive guidelines on the effectiveness of different formats for long-term preservation,
e.g.
● the Library of Congress recently released its recommended format specifications:
http://www.loc.gov/preservation/resources/rfs/
11 See Appendix 3: Draft ORA-Data Policy Statement 12 Bitstream preservation refers to the process of storing and maintaining digital objects over time,
ensuring that there is no loss or corruption of the bits (units of information) making up those objects; this
is effectively the lowest-level preservation possible, necessary but not sufficient to preserve meaningful
ongoing access. A more detailed definition is available here:
http://www.paradigm.ac.uk/workbook/preservation-strategies/degree-bitstream.html
34 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
● the UK Data Archive has recommendations on file formats and software:
http://www.data-archive.ac.uk/create-manage/format/formats
● the Digital Curation Centre (DCC) offers guidance on selecting formats which, while
slightly older, articulates useful principles for consideration when choosing file formats to
recommend: http://www.dcc.ac.uk/resources/curation-reference-manual/completed-
chapters/file-formats
▶ Recommendations: 5. Repository policy on data and formats
Finding 6: ORA-Data API
Several ITSS expected or wished to be able to use ORA-Data as a ‘back end’ for building a
website or application (e.g. storing the data in ORA-Data but providing an alternative web
interface for searching, browsing, visualising that data). The ability to do this would be a clear
incentive for depositing data in ORA-Data; however at present it would be difficult for most
datasets because of the structure of objects in the repository and the limitations of the API. The
main barriers to using ORA-Data for this type of development are:
1. Locations of objects in ORA-Data are not persistent. URLs in ORA-Data include the
silo name, which will change if a package is moved from one silo to another. It should be
acknowledged that in practice this is unlikely to happen frequently; however, the lack of
persistent addresses would be considered a risk if using the repository as the back end
for a web-based application.
2. Lack of granularity in addressing data. Once a database or table is ingested into
ORA-Data there is no way to address individual entries (e.g. by linking directly to them,
or targeting them in a search); similarly, once XML documents are ingested there is no
way to address or query them at anything lower than file level. Any application which
wanted to make effective use of data from ORA-Data in these formats would have to
store a local copy of the entire dataset and perform its own indexing, thus losing some of
the advantages of storing the data in the repository.
3. Inability to restrict a search by silo. The Databank API does not provide a way to limit
a search (e.g. for a data package or file name) by silo; at present there are relatively few
silos but as ORA-Data grows this will introduce significant inefficiencies into common
searches.
The API in its current state would allow ORA-Data to be used as a back end for simple
applications where the data are more ‘collection-like’, e.g. a set of images with metadata; the
First World War Poetry Digital Archive and Great War Archive produced a report which
Dharma Project Final Report 35
J. McKnight; C. Madsen; J. Prag | November 2014
addressed the possibility of using ORA-Data (then Databank) in this way13, and concluded that it
would be possible. However, this report also observed that the available search functionality
would be fairly restricted; that Databank was not yet a reliable enough service; and that
adequate support could not be guaranteed.
There is clearly a ‘chicken and egg’ problem here, that is: the limitations of the API restrict
development of interfaces using ORA-Data as a back end; the API could be developed further,
but without use cases, it is not clear what direction that development should take (and resources
for any further development are currently extremely limited).
▶ Recommendations: 6. ORA-Data API
Results
Datasets in ORA-Data
One of the key aims of the project was to acquire and ingest datasets from all the collaborating
projects. This aim has only partially been achieved, with datasets being acquired from the
following projects:
● Ashmolean Cyprus Digitisation Project
● Centre for the Study of the Cantigas de Santa Maria
● Dictionary of Medieval Latin from British Sources
● Early Modern Festival Books
● First World War Poetry Digital Archive / Great War Archive
● Lexicon of Greek Personal Names
● Oxford Roman Economy Project
● Sphakia Survey
Negotiations are still in progress to acquire data from Digital Miscellanies Index and Last
Statues of Antiquity. Only minimal metadata has been gathered for most projects.
Improved digital preservation guidelines
Some improvements have been made to the digital preservation guidelines on the DH@Ox
website14. However, our conversations with researchers strongly suggested that ‘passive’
guidelines like this were rarely consulted, and that the most useful information that a website
could give would be an indication of where to go for direct personalised advice and consultancy.
13 See Appendix 5: Databank and other solutions for archiving, searching and displaying the First World
War Poetry and Great War Archive collections 14 http://digital.humanities.ox.ac.uk/Support/Guidelines.aspx
36 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
Two specific areas where it was acknowledged that passive guidelines could serve a useful
reference purpose were
● accepted or recommended file formats for preservation
● information about licensing
Neither of these are institution-specific, and we would recommend resisting the urge to reinvent
the wheel in either case.
Conclusions
Recommendations
1. Preservation and sustainability
1.1 Ensure that data are discoverable, searchable, viewable, and downloadable while
‘archived’; this requires:
● good metadata
● some form of preview functionality for data in the repository
● ideally, the ability to search within datasets, not just within top-level metadata
1.2 Actively promote Oxford’s digital data as research materials, framing them in a
meaningful research context. We should regard them as assets in a collection, to be curated,
displayed, and exhibited where possible rather than merely catalogued and stored.
1.3 Choose terminology carefully in our communications to emphasise that ‘preserving’ is
ideally about keeping the data alive for ongoing use, not burying it. This message will be more
convincing if concrete examples can be cited which demonstrate active reuse of well-preserved
data.
1.4 While separating ‘data’ from ‘interface’ or ‘application’ may not be possible for existing
research data, future projects should bear this separation in mind and set different levels of
expectation for the longevity of each. Data will last longer than interfaces.
2. Funding gaps for sustainability and preservation
2.1 Maintain full oversight of research data, websites and applications for which
faculties/departments are currently responsible (and what that responsibility entails, e.g.
funding, hosting, maintenance, preservation, right to delete) and what their current status is (e.g.
development, live, maintenance, archive) by means of regular faculty-level audits of digital
research assets
Dharma Project Final Report 37
J. McKnight; C. Madsen; J. Prag | November 2014
2.2 Develop a clear funding model for the ongoing maintenance of relevant resources which
are judged to be current and useful
2.3 Provide a robust and reliable system of long-term preservation for resources which are
to be archived
2.4 Investigate the feasibility of offering a ‘free at point of service’ preservation facility (by
underwriting at institutional level, top-slicing divisions, or a combination of both); as long as
unofficial maintenance appears cheaper than official preservation, people will choose the former
despite the increased risks of data loss
2.5 Acknowledge that it may not be possible (or practical) to preserve everything. Every new
project should have an ‘end of life’ plan, prioritising what is to be preserved and what is not
3. Reuse and evolution
3.1 Improve the discoverability and searchability of datasets held by the institution by
encouraging the creation of good metadata and supporting the ingest of metadata with
metadata assistants
3.2 Identify, invest in and promote projects which demonstrate effective reuse
3.3 Encourage and promote collaboration with other institutions
3.4 Move towards wider adoption of providing research outputs as linked open data,
enabling more immediate and effective reuse
3.5 Where appropriate and practical, preserve project history and contexts for data capture
as part of the metadata
4. Advice, training, and mentoring
4.1 Invest in high-value consultancy and mentoring
4.2 Extend the pilot Social Sciences Research Data Management (RDM) mentoring scheme
to the Humanities, and where possible facilitate peer-to-peer mentoring rather than relying on
limited central resources
4.3 Build on exemplary training such as the Digital Humanities at Oxford Summer School
(http://digital.humanities.ox.ac.uk/dhoxss/), which combines the teaching of good data
management principles and the development of a strong community of practice with more
hands-on practical training
38 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
4.4 Promote technological consensus where appropriate (while recognising that there may
be cases where individual projects need to diverge from this, and considering the knock-on
effect of those decisions) – Oxford’s research support provision can never be ‘one size fits all’
but rather should target a suite of supported and well-understood technologies, chosen to fit
common patterns of research, on which in-house expertise can be focused;
4.5 Emphasise that preservation and sustainability need to be understood from the very
beginning of a project and that the technologies selected will have an impact on both; identify
key areas in existing programmes of humanities research training and advice where awareness
could be increased
4.6 Draw on enquiries to IT Services, BDLSS, and new RDM single point of contact to
develop a ‘knowledge base’ which all digital scholarly support staff can use
4.7 Set up a Digital Scholarly Support office (ideally with a staffed physical location as well
as online resources) to act as a unifying framework for all these initiatives, a knowledge
exchange centre for providers of research data management and preservation support, and a
more visible and approachable ‘front of house’ inviting enquiries from researchers, whether
simple questions or more exploratory, open-ended dialogue about digital research requirements
4.8 Invest in projects that join up and build upon existing technical and social services.
Ensure that the collaboration is fully resourced, funded, and encouraged.
4.9 Ensure all projects have a Data Management Plan (DMP)
5. Repository policy on data and formats
5.1 Establish internal clarity and consensus on repository policy, and communicate this
proactively to potential users
5.2 Handle communications about what data is permitted/invited tactfully, particularly in
terms of a) data which has no other appropriate repository and b) data which is not eligible or
appropriate for ORA-Data.
5.3 Avoid reinventing the wheel when producing guidelines on file formats – build on existing
guidelines from other institutions, always bearing in mind that the methods and formats chosen
for any research project should above all fit the goals of the project
Dharma Project Final Report 39
J. McKnight; C. Madsen; J. Prag | November 2014
5.4 Investigate and user-test the possibility of categorising data by format and/or availability
of metadata, as a way of signifying how reusable it is in different contexts (by analogy with e.g.
the emerging systems for signalling Open Access status15)
6. ORA-Data API
6.1 Enable persistent URLs for data packages
6.2 Enable the addressing of data at arbitrary levels of granularity, whether directly within
ORA-Data or by introducing an intermediate application layer which can ‘unpack’ data and
handle these requests16
6.3 Improve search options available (restrict search by silo, search within datasets), either
via the API or again via an intermediate layer
6.4 Develop prototype applications and interfaces making use of data in ORA-Data
Next steps
1. Information
We propose that a digital humanities data audit be carried out, investigating the extent of
research materials which have been produced and their current preservation status. This will
involve three stages:
● Perspective: survey researchers, research facilitators, ITSS to assess the extent of data
currently ‘at risk’.
● Priorities: rank data according to ‘endangeredness’, effort required to preserve it, and
value to research community.
● Projects: identify projects which can share methods, define work packages to preserve
‘at risk’ data, identify appropriate sources of funding where possible.
2. Innovation
We recommend more proactive curation and development of research materials to identify
datasets, research questions, applications, and methods which can act as prototypes for active
promotion of the benefits of data reuse to individuals, to the institution, and to the wider
research community. Money should be invested in projects that can directly illustrate the
benefits of preservation of research data, particularly interdisciplinary and cross-institutional
15 http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Open_Access/Signalling_OA-ness 16 For an outline of one possible way this could work, see Appendix 7: “How to curate an XML resource in ‘live’ and ‘dormant’ modes”
40 Dharma Project Final Report
J. McKnight; C. Madsen; J. Prag | November 2014
collaborations, and in working towards the goal of making more research outputs available as
open linked data.
A set of exemplar projects that highlight reuse and collaboration would make explicit the benefit
of preservation and open data, thereby motivating more investment and involvement in data
preservation and reuse.
3. Investment
We recommend continued investment in user education and in digital preservation
infrastructure, both technical and social, in order to capitalise on the creation of the Digital
Humanities Champion and the DH Network co-ordinator.
User education. A bottom-up approach is essential. We believe that personal ‘digital scholarly
support’ is key to establishing awareness and ‘buy-in’ within the division, not only at the level of
project development, but also in conveying the potential and challenges of DH approaches to
the entire academic community. It is expected that the Digital Humanities Champion will lead
here, but this role alone is not sufficient. By drawing attention to the possibilities of DH, the DH
Champion will emphasise the need for more joined up advice and support, and identify where
those connections need to be made. We therefore recommend
● identifying existing successful initiatives and building on these (e.g.: developing from the
DH@Ox Summer School to an in-term programme of training in DH methods for Oxford
academics; extending the RDM mentoring scheme, currently being piloted in Social
Sciences, to the Humanities), and
● identifying key areas where awareness of digital methods & RDM issues could be
increased in general humanities research training.
Digital preservation infrastructure. It is vital that we collaborate closely with the process of
investment in digital preservation infrastructure (technology, personnel, and policy framework) to
ensure both that the needs of humanities researchers are being met and that solutions
developed in DH can inform wider development. The proposed data audit (see 1, above) will
give us the information needed to predict more accurately our future infrastructure requirements;
proactive curation and development of DH data (see 2 above) will highlight the positive
contribution to digital preservation which DH research has to offer.