(Linked) Data Curation challenges
Kevin AshleyDirector, Digital Curation Centre
Reusable with attribution: CC-BYThe DCC is supported by Jisc
2
Acknowledgements
• John Wilkins & Cameron Neylon• Ideas, images, slides, inspiration
2013-07-05 Kevin Ashley – CC-BY
3
Data views and processes
• Administration• Discovery• Work-level description• Discipline-level interpretation
2013-07-05 Kevin Ashley – CC-BY
4
Administrative view
2013-07-05 Kevin Ashley – CC-BY
Data from projects funded by NERC
Data produced by the department of linguistics
5
Discovery view
2013-07-05 Kevin Ashley – CC-BY
Data about reproductive behaviour in freshwater fish
9
Data is variable
• Not always textual• Not always tabular• Not always fixed• Not always clearly authored – think of archival
provenance• Not always associated with publication
2013-07-05 Kevin Ashley – CC-BY
Kevin Ashley – CC-BY 10http://www.flickr.com/photos/sethw/113073189/
95% of research results are never published
2013-07-05
Kevin Ashley – CC-BY 11http://flickr.com/photos/heymans/480396810/
If a million postdocs repeat a million experiments…
2013-07-05
Kevin Ashley – CC-BY 12http://flickr.com/photos/cliche/120070310/
And 25% of those don’t work…
2013-07-05
Kevin Ashley – CC-BY 13
…how much taxpayer’s money is that?
http://flickr.com/photos/luismimunoznajar/2093185804/2013-07-05
Kevin Ashley – CC-BY 142013-07-05
I need that data now!!! I don’t care how messy it is – I
can fix it!
I’ve wasted too much of my life fixing other’s people’s bad
data. I’m not interested until you’ve cleaned it up and
documented it. Besides, I have other things to think about
15
Grandfather’s axe
2013-07-05 Kevin Ashley – CC-BY
[email protected] CC-BY-NC-SA
When is my dataset a new dataset?
16
Authorship
• Reference data – cell-level provenance versus single author data table
• ‘Cleaned’ data – can pass through many hands• Synthesis…
2013-07-05 Kevin Ashley – CC-BY
19
Potential wins
• Provenance of machine-gathered data – linking observations to instrument descriptions
• Linking data in multiple places• Data and publications and plans• Robust assertions about data versioning• Association of data with institutions
2013-07-05 Kevin Ashley – CC-BY
22
More wins
• Assertions at table and variable group level• Linking that crosses disciplinary boundaries:– Biochemistry and neuroscience– Naval history, economics and climate science
• Linking that crosses research and administrative boundaries
2013-07-05 Kevin Ashley – CC-BY
23
IGFBP-5 plays a role in the regulation of cellular senescence via a p53-dependent pathway and in aging-associated vascular diseases
2013-07-05 Kevin Ashley – CC-BY
After John WIlbanks
24
Tylenol
2013-07-05 Kevin Ashley – CC-BY
N-acetyl-p-aminophenolAcetaminophen
ParacetamolSameAsN-(4-hydroxyphenyl)ethanamideN-(4-hydroxyphenyl)acetamide
25
“I never had an idea that couldn’t be improved by sharing it with as
many people as possible…”
Bill Hooker (2006)http://3quarksdaily.blogs.com/3quarksdaily/2006/10/the_future_of_s_1.html
2013-07-05 Kevin Ashley – CC-BY