+ All Categories
Home > Technology > Current and emerging scientific data curation practices

Current and emerging scientific data curation practices

Date post: 18-May-2015
Category:
Upload: michael-day
View: 4,028 times
Download: 3 times
Share this document with a friend
Description:
Slides from a presentation given at the 4th DELOS Summer School on Digital Preservation in Digital Libraries, Tirrenia, Pisa, Italy, 8-13 June 2008
Popular Tags:
50
http://www.ukoln.ac.uk/ Current and Emerging Scientific Data Curation Practices Michael Day, Digital Curation Centre UKOLN, University of Bath [email protected] 4th Summer School on preservation in digital libraries, Tirrenia, Italy, 12 June 2008
Transcript
Page 1: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Current and Emerging Scientific Data Curation Practices

Michael Day,Digital Curation CentreUKOLN, University of [email protected]

4th Summer School on preservation in digital libraries, Tirrenia, Italy, 12 June 2008

Page 2: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Presentation outline:

– Some definitions– Reasons for curating research data– Some specific issues:

• Scale and complexity, diversity of social contexts, costs

– Types of research data collection– Roles and responsibilities– Potential for collaboration– Some open questions

Page 3: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Definitions – research data (1)

• What is research data?– An extremely broad category of material

• “... any information that can be stored in digital form, including text, numbers, images, video or movies, audio, software, algorithms, equations, animations, models, simulations, etc.” (National Science Board, Long-lived digital data collections, 2005)

• In practice, it can mean almost anything

Page 4: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Definitions – digital curation (1)

• DCC definition:– “... maintaining and adding value to a

trusted body of digital information for current and future use; specifically, we mean the active management and appraisal of data over the life-cycle of scholarly and scientific materials” (http://www.dcc.ac.uk/)

Page 5: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Definitions – digital curation (2)

• Main themes:– Curation is seen as an ongoing process,

e.g. the active management of data over time

– It is also about adding-value through things like community annotation

– Life-cycles are important, long-term stewardship not always necessary

– Not identical to digital preservation

Page 6: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Why curate research data? (1)

• Part of the normal research process:– The need for others to validate and

replicate research– In some disciplines, supporting data is

routinely made available to reviewers and linked from journal papers

– Principles of sharing and openness are firmly embedded in some disciplines

Page 7: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Why curate research data? (2)

• Extrinsic and intrinsic value;– High investment in research– Data can be very expensive to capture

and analyse– Data is impossible to recreate once lost– Observational data (by definition) is

irreplaceable– Current generations of instruments can

gather more data than can be analysed

Page 8: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Why curate research data? (3)

• The potential for creating 'new' knowledge from existing data:– Re-use, re-analysis, data mining– Annotation, e.g. in molecular biology

astronomy– Combining datasets in innovative ways,

e.g. mapping biodiversity data onto ecological GIS

– “Science 2.0”

Page 9: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Why curate research data? (4)

• It is increasingly a requirement of some research funding bodies– Some have quite mature data retention

policies (not necessarily for permanent retention)

– Increasing expectation of access to data from publicly-funded research

– OECD Principles and guidelines for access to research data from public funding (2007)

Page 10: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Why curate research data? (5)

• Institutional asset management:– Universities and other research

organisations invest very large sums of money into research activities

– Research data is a key output of this activity

– It is, therefore, an institutional asset that needs stewardship

Page 11: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Why curate research data? (6)

• Promoting the institution, research group or individual:– Re-use helps promote visibility and

'impact'– Institutions become acknowledged

'centres of competence'

Page 12: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Scale and complexity (1)

• Scale (1):– The “digital deluge”

• e-Science• New generations of instruments• Computer simulation• Mny terabytes generated per day, petabyte

scale computing (and growing)

Page 13: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Scale and complexity (2)

• Scale (2):– Problems of scale are particularly acute in

traditional 'big-science' disciplines:• Particle physics (e.g., Large Hadron Collider)• Astronomy (sky surveys, etc)

– Also increasingly important in:• Bioinformatics, crystallography, engineering

design, and many others

– May be cheaper just to generate the data again, e.g. for gene sequencing

Page 14: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Scale and complexity (3)

• Complexity (1)– Research data is extremely diverse - not

really a single category of material• tabular data, images, GIS, etc.• raw machine output vs, derived data• varying levels of structure (XML, legacy

formats, etc.)• many different standards

– Research data is not homogeneous– No one-size-fits-all approach possible

Page 15: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Scale and complexity (4)

• Complexity (2):– Even wider range of social contexts in

which data is used (and shared)– DCC SCARP project has been exploring

disciplinary factors in curation practice• Practice even within single disciplines is very

fragmented• Case studies ongoing

– Big-science archives, medical and social sciences, architecutre and engineering, biological images

Page 16: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Diversity of contexts

• Research cultures– Data practices vary widely, even within a

single discipline• Gene sequence data is typically deposited in

public databases• In proteomics sharing is not so widespread;

partly driven by lack of standards, but also about who has exploitation rights

– Role of commercial interests• Pharmaceuticals, architecture and

engineering, geological prospecting

Page 17: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Costs

• Recent JISC study (2008): – Focused on the institution level– Some findings:

• The complex service requirements for curating research data means that institutions are setting-up federated approaches to repository development

• Currently ingest costs are much higher than long-term storage and preservation costs

• Start-up (and R&D) costs are high, but there can be economies of scale

Page 18: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Research data collections (1)

• A typology (1):– From National Science Board report

Long-lived digital data collections (2005)• Research data collections – the products of

one or more focused research projects• Resource or community data collections –

collections that emerge to serve particular subject sub-disciplines

• Reference data collections – serve a broader and more diverse set of user communities

Page 19: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Research data collections (2)

• A typology (2)– Research data collections – the products

of one or more focused research projects• Extremely diverse• Have small user communities• Inconsistent standardisation• Typically no funding available to support the

collection beyond the project funding cycle

Page 20: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Research data collections (3)

• A typology (3)– Resource or community data collections –

collections that emerge to serve particular subject sub-disciplines

• Often establish community-level standards• In many cases supported by funding bodies or

particular research institutions

Page 21: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Research data collections (4)

• A typology (4)– Reference data collections – collections

that serve a broader and more diverse set of user communities

• conformance to robust, well-established standards essential

• Expensive (time and money)• Budget typically comes from multiple sources,

expectation that collections will persist

Page 22: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Research data collections (5)

• Data at risk– Data in “research data collections” is most

at risk• A modern version of the “file-drawer problem”• Data stored on personal hard-drives or on

media; largely undocumented (c.f DAF)• Particular challenge when the data creator

has retired or moved to another institution• Data creators not aways aware of its value• The reward structure of science is not helpful

Page 23: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Research data collections (6)

• Collections can evolve:– For example, Protein Data Bank (PDB)

• Launched 1971, small-scale, focused on a limited set of biological structures

• Now is the main source of experimental structural information on biological macromolecules

– How do we recognise research data collections that have the potential to evolve into reference collections?

Page 24: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Roles and responsibilities (1)

• Long-lived data collections (NSB)– Data authors– Data managers– Data scientists– Data users– Funding agencies

• Dealing with data (JISC)– Scientist– Institution– Data centre– User– Funder– Publisher

Page 25: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Roles and responsibilities (2)

• Scientists– Initial creation and use of data– Expectation of first use and in gaining

appropriate credit and recognition– Responsible for:

• Managing data for life of project• For using standards (where possible)• For complying with data policies• For making the data available in a form that

can (easily?) be used by others

Page 26: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Roles and responsibilities (3)

• Institutions:– Role less clear– Institutional policies may require short-

term management of data• Advocacy and training

– Some institutions are developing repository services

• Are rarely currently used for research data• Federated approaches maintain disciplinary

involvement

Page 27: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Roles and responsibilities (3)

• Data centres– Undertakes curation and provides access – Responsible for:

• Selection and ingest• Participating in the development of standards• Protecting the rights of data creators• Supporting ingest and metadata capture• Supporting re-use (tools and services)• Training

Page 28: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Roles and responsibilities (4)

• Users:– Users of third-party data– Responsible for:

• Adhering to any licenses and restrictions on use

• Acknowledging data creators and curators• Managing any derived data• Provide feedback to scientists and data

centres

Page 29: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Roles and responsibilities (5)

• Funding bodies:– Acting at policy level– Responsible for:

• Considering wider policy perspectives• Developing policies in co-operation with other

stakeholders• Monitoring and enforcing data policies• Support for long-term data management• Support for data curation

Page 30: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Curation infrastructures (1)

• Focus on the generic:– Need for a balance between:

• The 'bottom-up' discipline-based drivers that promote the generation of research data

• The policy level, looking to make cost effective investment in curation

– When building Infrastructures, focus on the generic

• Storage systems and middleware• Identifying the needs of the wider community

Page 31: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Curation infrastructures (2)

• The need for collaboration:– Need for 'deep-infrastructure' recognised

as far back as 1996 by the Task Force on Archiving of Digital Information

– Digital preservation involves the "grander problem of organizing ourselves over time and as a society ... [to manoeuvre] effectively in a digital landscape" (p. 7)

Page 32: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Collaboration on curation (1)

• Collaboration in science:– Collaboration is deeply embedded in

some (but not all) research cultures– Research collaboration is a well-

established phenomenon that has been studied by sociologists of science (and others)

– The nature of collaboration differs markedly between academic disciplines

Page 33: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Collaboration on curation (2)

• Scientific collaboration types:– Informal social networks

• Helps to define disciplinary norms and interpretational paradigms

– Formalised, semi-permanent organisations

• Traditionally most common in "big-science" domains

• The growth of e-science has emphasised the collaborative nature of research

Page 34: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Collaboration on curation (4)

• Implications for curation;– Collaborative data curation facilities might

emerge first in sub-disciplines that have a more participatory collaboration pattern or otherwise have a strong emphasis on data sharing

– Need for more systematic research into this across all research domains

• Building on DCC SCARP

Page 35: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Collaboration on curation (5)

• Collaboration is:– Currently focused at disciplinary or sub-

disciplinary levels– It is embedded within the workflows of

particular research communities (e.g., genomics, crystallography, astronomy)

– Takes advantage of the specialised knowledge available within particular "designated communities"

Page 36: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Collaboration on curation (6)

• Collaboration and standards:– Common standards emerge where there

is a recognised need for data sharing– The existence of common standards

make data centres and repositories viable

Page 37: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Collaboration on curation (7)

• Interdisciplinary collaboration;– Previously little demand for collaboration

on data curation across disciplinary borders

– But the fundamentally collaborative nature of e-research should make us challenge this:

• A need to pool resources and expertise• A need for supporting infrastructures

Page 38: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Collaboration on curation (8)

• Need for strategic alliances– National initiatives, e.g. DPC, NDIIPP,

nestor– European Alliance for Permanent Access

Page 39: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Open questions (1)

• The role of institutions– Universities are setting up repositories– Rhetoric suggests that they aim to

manage all research outputs (i.e. including data)

– In practice, they currently mostly deal with research papers

– What is the role of the institution with regard to research data?

– Do they have the trust of researchers?

Page 40: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Open questions (2)

• Are generic approaches possible?– There is a tension between the diversity

and complexity of research data and the need for generic solutions

• Promoting data sharing between disciplines• Interoperability

Page 41: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

Open questions (3)

• Data can only exist as part of wider research contexts– They are referenced in papers and other

forms of research communication, in project documentation and archives

– Linked from project Web pages, etc.– How do we ensure that curated data

remains integrated within this scholarly web?

– How do we make the links persistent?

Page 42: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Summing-up

• Size and diversity– Research data is extremely diverse– No one-size-fits-all solution– Scale is a growing problem

• Infrastructures:– Many data curation services already exist

– good practice– Need to integrate these (and institutional

initiatives) at the policy level

Page 43: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Further reading

•Neil Beagrie, Jullia Chruszcz, and Brian Lavoie, Keeping research data safe: a cost model and guidance for UK universities (JISC, 2008)

•Liz Lyon, Dealing with data; roles, rights, responsibilities and relationships (JISC, 2007)

•National Science Board, Long-lived digital data collections: enabling research and education in the 21st century (NSF, 2005)

Page 44: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Exercise

Page 45: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Exercise (1)

• 4 Scenarios:– A research team in 2028 is evaluating a

particular set of content for use in a particular project (Web content, multimedia, images, dataset)

– Ask questions about what they would need to know to interpret the content correctly

– Evaluate the relative importance of: content, context, appearance, structure, behaviour

Page 46: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Exercise (2)

• Scenario 1: A research project in 2028 is trying to explore how the first generation Internet was used by European political parties in the 1990s to promote citizen participation in policy formation. The investigators know that a large amount of Web material from this period is held by an organisation called the Internet Archive, and they have begun to use data mining tools to explore the extent of their holdings. What will they need to know about the collection in order to be able to do their work properly?

Page 47: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Exercise (3)

• Scenario 2: Art curators in 2028 are trying to put together an exhibition of digital art in a public gallery. They have found that a university art department retains a collection of digital art resources (chiefly multimedia) produced by their undergraduate students between 2000-2005, some of which have gone on to become extremely important figures in the art establishment. When evaluating the collection for use in the exhibition, what would they consider to be the most important object characteristics?

Page 48: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Exercise (4)

• Scenario 3: Healthcare researchers in 2028 are trying to trace the historical incidence certain lung abnormalities and have access to a massive database of medical images (X-rays) that they intend to submit to the most up to date content-based image retrieval techniques. The database is made up of imaging output from more than one hospital and the researchers are worried that certain parameters essential to their research (e.g., the age and sex of patient, imaging dates, etc.) may be missing. What else need they know about the database before they can start running their search algorithms?

Page 49: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Exercise (5)

• Scenario 4: A research project in 2028 is trying to find links between climate records and biological species diversity in south-west England. The principal investigator has found a promising dataset of geographically-relevant biodiversity information in a local history museum. What more does she need to know about this dataset before she can get her team to try to integrate this dataset (and others like it) with historical climate models?

Page 50: Current and emerging scientific data curation practices

                                                             

http://www.ukoln.ac.uk/

4th DELOS Summer School on Digital Preservation, 12 June 2008

Acknowledgements

The Digital Curation Centre is funded by the JISC and the UK Research Councils' e-Science Core Programme.

http://www.dcc.ac.uk/

UKOLN is funded by the Museums, Libraries and Archives Council, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC, the European Union, and other sources. UKOLN also receives support from the University of Bath, where it is based.

http://www.ukoln.ac.uk/


Recommended