Data Selection & Triage
JISC/DCC Progress
Workshop Managing
Research Data & Institutional
EngagementNottingham25 October
2012
This work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License
Introduction
How can researchers and support staff effectively decide what data is worth holding on to, agree what to do with it, and arrange for its handover?
What challenges does this represent
How to address them?
Outline
• What guidelines are there and why do we need more?Angus Whyte DCC and Marie Therese Gramstadt - KAPTUR
• UK Data Archive's Data Review Process - Veerlevan Eynden UKDA
• Applying NERC's Data Value Checklist - Sam Pepler, British Atmospheric Data Centre
• Discussion
Guidelines clarify expectations
What criteria will be used to judge what’s handed over?
…adapted by Archaeology Data Service NERC KAPTUR University of Leicester
Basic model
1. Define a policy i.e. criteria and range of decisions
2. Archive manager applies criteria, involving researchers
3. Select the significant, dispose of the rest
For records records yes, but researchdata?
All data
10%
90%
Characterising research data…• Research process more uncertain and open-ended
than admin processes
• Research data purpose may change before complete
• More effort to make reusable - complex inter-relationships, and richer contexts to document
• Originators should be engaged but may not have capacity e.g. if project funding has ceased
• Others may need to be involved with broader view of potential in other disciplines
• More than keep/dispose choice –need to prioritise attention and effort to make data fit for reuse
Triage analogy
Criteria
Duty of care
Reuse value
Quality and condition
Accessibility
Costs associated
Prioritise
High reuse value +needs attentionaffordable
Otherpermutations
More permutations
Low reuse value,Unaffordable
Tiered approach to deploying resources
Discoverability
Access management
Storage performance
Preservation actions
Deposit location
Institutional Data Repository
Data Centre
Subject Repository etc.
Potential to automate ?
First characterise
research data
Clarify expectations
What kinds of “data” are wanted
For what kinds of reuse
e.g.Data Centre Collection Policies
9
http://archaeologydataservice.ac.uk/advice/collectionsPolicy
“The ADS expects to collect all of the following archaeological data types…”
Costs should persuade us
10
IDC Digital Universe Study- Increasing volumes outpace declining storage hardware costs
According to: John Gantz and David Reinsel 2011 Extracting Value from Chaos http://www.emc.com/digital_universe.
We can’t afford it all
11
“Keeping 2018’s data in S3 would cost the entire global GDP”
http://blog.dshr.org/2012/05/lets-just-keep-everything-forever-in.html
Selection presumes description
12
• You can’t value what you don’t know about!
• Researchers can’t afford NOT to spend effort on minimal metadata description and organisation, because costs of retention will be much higher if they don’t
• Description makes data affordable – is citation potential a concrete enough reward?
Challenges
• Identify what datasets are created and where they are
• Differentiate those that are of high value from those where most uncertainty or least reusability
• Be able to justify ‘natural’ wastage of low priority data as much as deliberate selection of high value
Questions
• What has worked/is working
• What lessons have you learned and how generalisable
• What challenges remain
• How may they be approached and what do you intend to do
• What DCC / MRD activity do you think may help make the challenge more tractable.