UKOLN is supported by:
Enhancing access to research data: the challenge of crystallography
Rachel Heery, Monica Duke, Michael Day
UKOLN, University of Bath
Leslie Carr, Simon Coles
University of Southampton
www.bath.ac.uk
A centre of expertise in digital informaion management
JCDL 2005, June 7-11, Denver
Enhancing access to research data: overview
• Crystallography as an exemplar
• Impact of digital technologies on scientific research process
• Need new modes of data curation
• eBank project: applying digital library techniques to support data curation
• Next steps
Changes in scientific research process
• Increasing data volumes from eScience / Grid-enabled / cyber-infrastructure applications, “big science”
• Changing research methods: high througput technologies, automation, ‘smart labs’
• Potential for re-use of data, new inter-disciplinary research
• Different types of data: observational data, experimental data, computational data: different stewardship requirements
Data Overload!
How do we disseminate?
EPSRC National Crystallography
Service
The data deluge: crystallography
Data overload & the publication bottleneck
Cl
Cl
Cl
Cl
Cl
Cl
ClCl Cl
Cl
Cl
ClCl
O
O
O
O
N
N
N
N
N+
O
O
O
N+
O
O
O
25,000,000
2,000,000
300,000
Current Publishing Process• Journal articles: aims, ideas, context, conclusions – only most significant data
• Raw & underlying data required by peers not readily available
Context: existing data repositories• National data archives:
– UK Data Archive, Arts and Humanities Data Service, US National Archives and Records Administration (NARA), Atlas Datastore
• Discipline specific archives: – GenBank, Protein Data Bank
• Crystallography archives– Cambridge Crystallographic Data Centre (Cambridge
Structural Database) , Indiana University Molecular Structure Center (Crystal Data Server, Reciprocal Net), FIZ Karlsruhe (Inorganic crystals), Toth Information Systems (CHRYSTMET)
• Journals require deposit of data to support articles– Typically deposit of summary data…. partial coverage
Crystallography workflowRAW DATA DERIVED DATA RESULTS DATA
• Initialisation: mount new sample on diffractometer & set up data collection
• Collection: collect data• Processing: process and correct images• Solution: solve structures• Refinement: refine structure• CIF: produce CIF (Crystallographic Information File)• Validation: chemical & crystallographic checks
eBank UK project overview
• JISC funded in 2003, now in Phase 2 to 2006• Joint effort between crystallographers, computer
scientists, digital library researchers• Investigating contribution of existing digital library
technologies to enable ‘publication at source’• Partners have interest in dissemination of
chemistry research data, open access, OAI, institutional repositories http://www.ukoln.ac.uk/projects/ebank-uk/
eBank project team
University of Bath, UKOLN• Michael Day, Monica Duke, Rachel Heery, Liz
Lyon, Traugott KochUniversity of Southampton, School of Chemistry• Simon Coles, Jeremy Frey, Mike HursthouseUniversity of Southampton, School of Electronics
and Computer Science• Leslie Carr, Chris GutteridgeUniversity of Manchester, PSIgate• John Blunden-Ellis
eBank phase one: achievements• Gathered requirements from crystallographers • Established pilot institutional repository for
crystallography data at Southampton with web interface
• Developed a demonstrator aggregator service at UKOLN (CCDC exploring aggregation service)
• Developed appropriate schema • Demonstrated a search interface as an embedded
service at PSIgate portal• Demonstrated an added value service linking
research data to papers (one-off)
Institutional repositories…publication at source
• Institution establishes repository(s)• Institution pro-actively supports deposit
process• OAI provides basis for interoperability • Potential for added value services
• And/Or ….international subject based archives?
Crystallography good fit….
• Crystallography has well defined data creation workflow
• Tradition of sharing using standard file format
• Crystallography Information File (CIF)
• What about other chemistry sub-disciplines? other scientific disciplines?
Data Flow in eBank UK
OA
I-P
MH
Submit
Store/link
Harvest (XML)
Index and Search
Data files
Metadatapresent
HTML
present
HTML
Institutional repository
eBank aggregator
Create
OAI-PMH: harvesting and aggregating
eBank aggregator at UKOLNhttp://eprints-uk.rdn.ac.uk/ebank-demo/
Demonstrating potential for linking between data and journal article
Schema for records made available for harvesting• Data holding (collection of files associated with
experiment)• Qualified Dublin Core data elements plus additional chemical
properties – Empirical formula– International Chemical Identifier (InChI)– Compound Class
• Individual data files• Separate records for stage status of each file
• Description set wrapped into one XML record using METS
• Research metadata/data as a complex object
ebank_dc record (XML)
Crystal structure (data holding)
Crystal structure report (HTML)
Dataset
Dataset
Institutional repositories
eBank UK aggregator service
ePrint UK aggregator service
Other aggregators and services
DepositHarvesting OAI-PMH
ebank_dc
Harvesting OAI-PMH oai_dc,ebank_dc
Harvesting OAI-PMH oai_dc
Dataset
dc:identifier
dcterms:references
Linking
dc:type=“CrystalStructure”
Model input Andy Powell, UKOLN.
Eprint oai_dc record (XML)
dcterms:isReferencedBy
dc:type=“Eprint” and/or ”Text”
eBank data model
Eprint “jump-off” page (HTML)
dc:identifierEprint manifestation (e.g. PDF)
Linking
Dep
osit
Creating the metadata
• Potential to embed ‘deposit and disseminate’ into workflow of chemist in automated way
Data Collection
Diffraction
Unit Cell
Success
Strategy
Data Collection
Data Process
System Y
PreScans
Yes
Yes
BruNo Mount
BruNo Unmount
Setup via GUI
Sample Tray
No
No
eBank phase two work areas
• Sub-disciplines of chemistry and physical sciences
• Pursue generic data model• Use of identifiers for citing datasets• Subject approach to discovering research
data• Access to research data in teaching and
learning context• Liaise with other digital repository initiatives
For the future…
• Who provides added value services?– Authority files, automated subject indexing, annotation,
data mining, visualisation
• What are the preservation issues?– UK Digital Curation Centre http://www.dcc.ac.uk
– National Science Board Draft report on long-lived data collections http://www.nsf.gov/nsb/meetings/2005/LLDDC_draftreport.pdf
• How to manage complex objects descriptions within OAI
• Digital curation of research data presents new roles for scientists, computer scientists, data managers…. ‘data scientists’