Post on 28-Mar-2015
transcript
© S.J. Coles 2006
Enabling the reusability of scientific data: Experiences with designing an
open access infrastructure for sharing datasets
Simon J. Coles
EPSRC National Crystallography Service
School of Chemistry
University of Southampton
© S.J. Coles 2006
Data & the Publication Problem
Cl
Cl
Cl
Cl
Cl
Cl
ClCl Cl
Cl
Cl
ClCl
O
O
O
O
N
N
N
N
N+
O
O
O
N+
O
O
O
25,000,000
2,000,000
450,000
© S.J. Coles 2006
A Different Approach to Data Publication?
Underlying dataIntellect & Interpretation
© S.J. Coles 2006
Requirements
• Capture of all digital data and information generated during the course of an experiment
• Data validation• Adding value• Archival system for data with attached
bibliographic and chemical metadata• Automatic report generation• Schema and protocols for publication and
dissemination of a dataset
© S.J. Coles 2006
Open Access Crystal Structure Archive
ecrystals.chem.soton.ac.uk
© S.J. Coles 2006
Access to the Underlying Data
© S.J. Coles 2006
Publicising Content
© S.J. Coles 2006
Harvesting, Linking and Aggregating
© S.J. Coles 2006
Usability: Quality & Uniformity of data• Different laboratories, practices & instruments present
a heterogeneous body of data
• Publish according to IUCr ratified schema
• To support publication according to this schema a toolbox add-on to the archive has been developed
• Toolbox requires 2 mandatory files only & is capable of performing file format conversions and generate value added files
© S.J. Coles 2006
Usability: Ease of Deposition & Metadata Quality
• Minimal number of manual metadata entries – many can be hardwired into the system
• Deposition guidelines initially prepared by students to provide impartial feedback
• Full documentation and in-line help/examples• Restrained lists, e.g. Keywords• Data deposited automatically by toolbox• Automated generation of metadata for report
and OAI interface
© S.J. Coles 2006
Usability: Data Validation
• Peer review removed from self deposit publication
• Simple checks for consistency made by the toolbox• Checks for crystallographic integrity made through a
web service (IUCr, ‘CHECKCIF’)• Introduction of data ‘editor’ for the archive; a
deposition must be signed-off by a recognised professional before going live
• Quality indicators automatically taken from dataset and presented in HTML jump-off page
© S.J. Coles 2006
Usability: Identifiers
• URL of deposited dataset provides an identifier• Persistent only if the Institutional support model is
accepted / adopted
• Signed-up to an agency to register metadata relating to datasets with a DOI
• Pay registry to ensure that DOI always resolves to associated dataset (10cents to register 1cent per annum to maintain)
• InChI chemical identifier - a unique text descriptor for a molecule
© S.J. Coles 2006
Usability: Dissemination & Aggregation
• OAI metadata schema; ratified by IUCr & chemical community
• OAI covers bibliographic terms; must introduce chemical terms
• Both library and subject specific aggregators satisfied
• Chemical linking; InChI, chemical classifications and restricted keywords list
© S.J. Coles 2006
Usability: Endorsement
• Feedback during development from technical publishing arm of IUCr
• Designed for automatic incorporation into CSD (global database operated by CCDC)
• Accepted by Executive Committee of IUCr
• Reuse of data achieved in collaboration with Leverhulme Centre for Molecular Informatics
© S.J. Coles 2006
Usability: Community Uptake
• Southampton archive about to publish routinely via the archive
• Five crystallography laboratories in UK agreed to adopt philosophy, install and populate archives
• CCDC will harvest required data from all archives
• IUCr will harvest and curate all data• Develop aggregator services in collaboration
with IUCr
© S.J. Coles 2006
Usability: The Next Challenges
• Full acceptance by chemical community– Validation worries– Curation worries– The requirement for as many peer reviewed
publications as possible (despite quality)• Full acceptance by wider chemistry publishing
community– Loss of control over underlying data– Faith in Open Archives replacing experimental
descriptions in articles• Development of fully functional aggregator services