Date post: | 18-May-2015 |
Category: |
Technology |
Upload: | michael-day |
View: | 4,028 times |
Download: | 3 times |
http://www.ukoln.ac.uk/
Current and Emerging Scientific Data Curation Practices
Michael Day,Digital Curation CentreUKOLN, University of [email protected]
4th Summer School on preservation in digital libraries, Tirrenia, Italy, 12 June 2008
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Presentation outline:
– Some definitions– Reasons for curating research data– Some specific issues:
• Scale and complexity, diversity of social contexts, costs
– Types of research data collection– Roles and responsibilities– Potential for collaboration– Some open questions
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Definitions – research data (1)
• What is research data?– An extremely broad category of material
• “... any information that can be stored in digital form, including text, numbers, images, video or movies, audio, software, algorithms, equations, animations, models, simulations, etc.” (National Science Board, Long-lived digital data collections, 2005)
• In practice, it can mean almost anything
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Definitions – digital curation (1)
• DCC definition:– “... maintaining and adding value to a
trusted body of digital information for current and future use; specifically, we mean the active management and appraisal of data over the life-cycle of scholarly and scientific materials” (http://www.dcc.ac.uk/)
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Definitions – digital curation (2)
• Main themes:– Curation is seen as an ongoing process,
e.g. the active management of data over time
– It is also about adding-value through things like community annotation
– Life-cycles are important, long-term stewardship not always necessary
– Not identical to digital preservation
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Why curate research data? (1)
• Part of the normal research process:– The need for others to validate and
replicate research– In some disciplines, supporting data is
routinely made available to reviewers and linked from journal papers
– Principles of sharing and openness are firmly embedded in some disciplines
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Why curate research data? (2)
• Extrinsic and intrinsic value;– High investment in research– Data can be very expensive to capture
and analyse– Data is impossible to recreate once lost– Observational data (by definition) is
irreplaceable– Current generations of instruments can
gather more data than can be analysed
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Why curate research data? (3)
• The potential for creating 'new' knowledge from existing data:– Re-use, re-analysis, data mining– Annotation, e.g. in molecular biology
astronomy– Combining datasets in innovative ways,
e.g. mapping biodiversity data onto ecological GIS
– “Science 2.0”
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Why curate research data? (4)
• It is increasingly a requirement of some research funding bodies– Some have quite mature data retention
policies (not necessarily for permanent retention)
– Increasing expectation of access to data from publicly-funded research
– OECD Principles and guidelines for access to research data from public funding (2007)
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Why curate research data? (5)
• Institutional asset management:– Universities and other research
organisations invest very large sums of money into research activities
– Research data is a key output of this activity
– It is, therefore, an institutional asset that needs stewardship
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Why curate research data? (6)
• Promoting the institution, research group or individual:– Re-use helps promote visibility and
'impact'– Institutions become acknowledged
'centres of competence'
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Scale and complexity (1)
• Scale (1):– The “digital deluge”
• e-Science• New generations of instruments• Computer simulation• Mny terabytes generated per day, petabyte
scale computing (and growing)
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Scale and complexity (2)
• Scale (2):– Problems of scale are particularly acute in
traditional 'big-science' disciplines:• Particle physics (e.g., Large Hadron Collider)• Astronomy (sky surveys, etc)
– Also increasingly important in:• Bioinformatics, crystallography, engineering
design, and many others
– May be cheaper just to generate the data again, e.g. for gene sequencing
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Scale and complexity (3)
• Complexity (1)– Research data is extremely diverse - not
really a single category of material• tabular data, images, GIS, etc.• raw machine output vs, derived data• varying levels of structure (XML, legacy
formats, etc.)• many different standards
– Research data is not homogeneous– No one-size-fits-all approach possible
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Scale and complexity (4)
• Complexity (2):– Even wider range of social contexts in
which data is used (and shared)– DCC SCARP project has been exploring
disciplinary factors in curation practice• Practice even within single disciplines is very
fragmented• Case studies ongoing
– Big-science archives, medical and social sciences, architecutre and engineering, biological images
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Diversity of contexts
• Research cultures– Data practices vary widely, even within a
single discipline• Gene sequence data is typically deposited in
public databases• In proteomics sharing is not so widespread;
partly driven by lack of standards, but also about who has exploitation rights
– Role of commercial interests• Pharmaceuticals, architecture and
engineering, geological prospecting
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Costs
• Recent JISC study (2008): – Focused on the institution level– Some findings:
• The complex service requirements for curating research data means that institutions are setting-up federated approaches to repository development
• Currently ingest costs are much higher than long-term storage and preservation costs
• Start-up (and R&D) costs are high, but there can be economies of scale
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Research data collections (1)
• A typology (1):– From National Science Board report
Long-lived digital data collections (2005)• Research data collections – the products of
one or more focused research projects• Resource or community data collections –
collections that emerge to serve particular subject sub-disciplines
• Reference data collections – serve a broader and more diverse set of user communities
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Research data collections (2)
• A typology (2)– Research data collections – the products
of one or more focused research projects• Extremely diverse• Have small user communities• Inconsistent standardisation• Typically no funding available to support the
collection beyond the project funding cycle
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Research data collections (3)
• A typology (3)– Resource or community data collections –
collections that emerge to serve particular subject sub-disciplines
• Often establish community-level standards• In many cases supported by funding bodies or
particular research institutions
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Research data collections (4)
• A typology (4)– Reference data collections – collections
that serve a broader and more diverse set of user communities
• conformance to robust, well-established standards essential
• Expensive (time and money)• Budget typically comes from multiple sources,
expectation that collections will persist
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Research data collections (5)
• Data at risk– Data in “research data collections” is most
at risk• A modern version of the “file-drawer problem”• Data stored on personal hard-drives or on
media; largely undocumented (c.f DAF)• Particular challenge when the data creator
has retired or moved to another institution• Data creators not aways aware of its value• The reward structure of science is not helpful
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Research data collections (6)
• Collections can evolve:– For example, Protein Data Bank (PDB)
• Launched 1971, small-scale, focused on a limited set of biological structures
• Now is the main source of experimental structural information on biological macromolecules
– How do we recognise research data collections that have the potential to evolve into reference collections?
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Roles and responsibilities (1)
• Long-lived data collections (NSB)– Data authors– Data managers– Data scientists– Data users– Funding agencies
• Dealing with data (JISC)– Scientist– Institution– Data centre– User– Funder– Publisher
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Roles and responsibilities (2)
• Scientists– Initial creation and use of data– Expectation of first use and in gaining
appropriate credit and recognition– Responsible for:
• Managing data for life of project• For using standards (where possible)• For complying with data policies• For making the data available in a form that
can (easily?) be used by others
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Roles and responsibilities (3)
• Institutions:– Role less clear– Institutional policies may require short-
term management of data• Advocacy and training
– Some institutions are developing repository services
• Are rarely currently used for research data• Federated approaches maintain disciplinary
involvement
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Roles and responsibilities (3)
• Data centres– Undertakes curation and provides access – Responsible for:
• Selection and ingest• Participating in the development of standards• Protecting the rights of data creators• Supporting ingest and metadata capture• Supporting re-use (tools and services)• Training
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Roles and responsibilities (4)
• Users:– Users of third-party data– Responsible for:
• Adhering to any licenses and restrictions on use
• Acknowledging data creators and curators• Managing any derived data• Provide feedback to scientists and data
centres
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Roles and responsibilities (5)
• Funding bodies:– Acting at policy level– Responsible for:
• Considering wider policy perspectives• Developing policies in co-operation with other
stakeholders• Monitoring and enforcing data policies• Support for long-term data management• Support for data curation
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Curation infrastructures (1)
• Focus on the generic:– Need for a balance between:
• The 'bottom-up' discipline-based drivers that promote the generation of research data
• The policy level, looking to make cost effective investment in curation
– When building Infrastructures, focus on the generic
• Storage systems and middleware• Identifying the needs of the wider community
http://www.ukoln.ac.uk/
Curation infrastructures (2)
• The need for collaboration:– Need for 'deep-infrastructure' recognised
as far back as 1996 by the Task Force on Archiving of Digital Information
– Digital preservation involves the "grander problem of organizing ourselves over time and as a society ... [to manoeuvre] effectively in a digital landscape" (p. 7)
http://www.ukoln.ac.uk/
Collaboration on curation (1)
• Collaboration in science:– Collaboration is deeply embedded in
some (but not all) research cultures– Research collaboration is a well-
established phenomenon that has been studied by sociologists of science (and others)
– The nature of collaboration differs markedly between academic disciplines
http://www.ukoln.ac.uk/
Collaboration on curation (2)
• Scientific collaboration types:– Informal social networks
• Helps to define disciplinary norms and interpretational paradigms
– Formalised, semi-permanent organisations
• Traditionally most common in "big-science" domains
• The growth of e-science has emphasised the collaborative nature of research
http://www.ukoln.ac.uk/
Collaboration on curation (4)
• Implications for curation;– Collaborative data curation facilities might
emerge first in sub-disciplines that have a more participatory collaboration pattern or otherwise have a strong emphasis on data sharing
– Need for more systematic research into this across all research domains
• Building on DCC SCARP
http://www.ukoln.ac.uk/
Collaboration on curation (5)
• Collaboration is:– Currently focused at disciplinary or sub-
disciplinary levels– It is embedded within the workflows of
particular research communities (e.g., genomics, crystallography, astronomy)
– Takes advantage of the specialised knowledge available within particular "designated communities"
http://www.ukoln.ac.uk/
Collaboration on curation (6)
• Collaboration and standards:– Common standards emerge where there
is a recognised need for data sharing– The existence of common standards
make data centres and repositories viable
http://www.ukoln.ac.uk/
Collaboration on curation (7)
• Interdisciplinary collaboration;– Previously little demand for collaboration
on data curation across disciplinary borders
– But the fundamentally collaborative nature of e-research should make us challenge this:
• A need to pool resources and expertise• A need for supporting infrastructures
http://www.ukoln.ac.uk/
Collaboration on curation (8)
• Need for strategic alliances– National initiatives, e.g. DPC, NDIIPP,
nestor– European Alliance for Permanent Access
http://www.ukoln.ac.uk/
Open questions (1)
• The role of institutions– Universities are setting up repositories– Rhetoric suggests that they aim to
manage all research outputs (i.e. including data)
– In practice, they currently mostly deal with research papers
– What is the role of the institution with regard to research data?
– Do they have the trust of researchers?
http://www.ukoln.ac.uk/
Open questions (2)
• Are generic approaches possible?– There is a tension between the diversity
and complexity of research data and the need for generic solutions
• Promoting data sharing between disciplines• Interoperability
http://www.ukoln.ac.uk/
Open questions (3)
• Data can only exist as part of wider research contexts– They are referenced in papers and other
forms of research communication, in project documentation and archives
– Linked from project Web pages, etc.– How do we ensure that curated data
remains integrated within this scholarly web?
– How do we make the links persistent?
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Summing-up
• Size and diversity– Research data is extremely diverse– No one-size-fits-all solution– Scale is a growing problem
• Infrastructures:– Many data curation services already exist
– good practice– Need to integrate these (and institutional
initiatives) at the policy level
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Further reading
•Neil Beagrie, Jullia Chruszcz, and Brian Lavoie, Keeping research data safe: a cost model and guidance for UK universities (JISC, 2008)
•Liz Lyon, Dealing with data; roles, rights, responsibilities and relationships (JISC, 2007)
•National Science Board, Long-lived digital data collections: enabling research and education in the 21st century (NSF, 2005)
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Exercise
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Exercise (1)
• 4 Scenarios:– A research team in 2028 is evaluating a
particular set of content for use in a particular project (Web content, multimedia, images, dataset)
– Ask questions about what they would need to know to interpret the content correctly
– Evaluate the relative importance of: content, context, appearance, structure, behaviour
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Exercise (2)
• Scenario 1: A research project in 2028 is trying to explore how the first generation Internet was used by European political parties in the 1990s to promote citizen participation in policy formation. The investigators know that a large amount of Web material from this period is held by an organisation called the Internet Archive, and they have begun to use data mining tools to explore the extent of their holdings. What will they need to know about the collection in order to be able to do their work properly?
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Exercise (3)
• Scenario 2: Art curators in 2028 are trying to put together an exhibition of digital art in a public gallery. They have found that a university art department retains a collection of digital art resources (chiefly multimedia) produced by their undergraduate students between 2000-2005, some of which have gone on to become extremely important figures in the art establishment. When evaluating the collection for use in the exhibition, what would they consider to be the most important object characteristics?
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Exercise (4)
• Scenario 3: Healthcare researchers in 2028 are trying to trace the historical incidence certain lung abnormalities and have access to a massive database of medical images (X-rays) that they intend to submit to the most up to date content-based image retrieval techniques. The database is made up of imaging output from more than one hospital and the researchers are worried that certain parameters essential to their research (e.g., the age and sex of patient, imaging dates, etc.) may be missing. What else need they know about the database before they can start running their search algorithms?
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Exercise (5)
• Scenario 4: A research project in 2028 is trying to find links between climate records and biological species diversity in south-west England. The principal investigator has found a promising dataset of geographically-relevant biodiversity information in a local history museum. What more does she need to know about this dataset before she can get her team to try to integrate this dataset (and others like it) with historical climate models?
http://www.ukoln.ac.uk/
4th DELOS Summer School on Digital Preservation, 12 June 2008
Acknowledgements
The Digital Curation Centre is funded by the JISC and the UK Research Councils' e-Science Core Programme.
http://www.dcc.ac.uk/
UKOLN is funded by the Museums, Libraries and Archives Council, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC, the European Union, and other sources. UKOLN also receives support from the University of Bath, where it is based.
http://www.ukoln.ac.uk/