Building Capability for Facilities Supported Structural Science
Brian Matthews
Scientific Information GroupE-Science Centre
STFC Rutherford Appleton Laboratory
Facilities Support
Big Facilities for Small Science
DLS
ISISCLF
The Science we do - Structure of materials
• ~30,000 user visitors each year in Europe: – physics, chemistry, biology,
medicine, – energy, environmental,
materials, culture– pharmaceuticals,
petrochemicals, microelectronics
Fitting experimental data to model
Bioactive glass for bone growth
Structure of cholesterol in crude oil
Hydrogen storage for zero emission vehicles
Magnetic moments in electronic storageLongitudinal strain in
aircraft wing
Diffraction pattern from sample
Visit facility on research campus
Place sample in beam
• Billions of € of investment– c. £400M for DLS– + running costs
• Over 5.000 high impact publications per year in Europe
– But so far no integrated data repositories
– Lacking sustainability & traceability
I2S2 - Infrastructure for Integration in Structural Sciences
Bridging the gap between raw and derived data
“Lone” researcher scenario• data sharing with colleagues via email• Little or no infrastructure• Little management of raw or derived data
EPSRC National Crystallography Service
• service provision function
• operates across institutions
• moderate infrastructure
Diamond & ISIS•operates on behalf of multiple institutions •processes for experiments •large infrastructure engineered to manage raw data•derived data taken off site on laptops / removable drives
Facilities Lifecycle
Proposal
Approval
SchedulingExperiment
Data cleansing
Record Publication
Scientist submits
application for beamtime
Facility committee approves
application
Facility registers, trains, and schedules
scientist’s visit
Scientists visits, facility run’s experiment
Subsequent publication
registered with facility
Raw data filtered and cleansed
Data analysis
Tools for processing made
available
Metadata Repository
Investigation
Publication KeywordTopic
Sample Sample ParameterDataset
Dataset ParameterDatafile
Datafile Parameter
Investigator
Related Datafile
Parameter
Authorisation
Core Scientific Metadata Model (CSMD)A common general format for Scientific Studies and data holdings metadata
• Cataloguing data holdings• Related to the experiment• Provide access for the Data Owner
• Ease citation, sharing collaboration, and integration
• Allow easy Federation of distributed metadata into a homogeneous Platform
The Core Metadata model forms the information model for ICAT.
Within a Data Policy
Interactions between research process
Grant Proposal
Facilities Proposal
FacilitiesExperiment Data
cleansing
Record Publication
Data analysisLocal
experiments
Simulation
Sample Preparation
Literature Review
Publication
Proposal
Approval
Scheduling
Facilities Experiment
Data storage
Record Publication
Analysis Tools
ORE-CHEM• An abstract model for planning and
enacting chemistry experiments
Earth Sciences: typical workflow
Martin Dove & Erica Yang
• Processing dependent on specialised software
• Sustainability issues • Context not routinely
captured• Main analysis is reliant on
scientist’s knowledge and experience
• selecting parameters and interpreting data
• recorded in a lab note book • Actual workflow not
recorded• Distributed Data - Little
shared infrastructure• Raw and reduced data stored at
ISIS• Other data on his/her laptop or
WebDAV
Interoperability with PublishersIUCr journal policy - “data” either
• must be supplied in CIF format as an integral part of article submission and are freely available for download or
• must be deposited with the Protein Data Bank before or in concert with article publication; the article will link to the PDB deposition using the PDB reference code
Thanks to Brian MacMahon, IUCr
Publication flow in IUCr journals
Experiment
Structure solution
IUCr journals
Chemistry databases
Data reduction
Validation Peer review
editing Publish
Bibliographic databases
RAW
Research Activity Model
A notion of a research activity – a step in the lifecycle model- Can define different types of activity.
Capabilities• Good established formats
– Raw data – e.g. NeXus– Analysed data – e.g. CIF
• Well supported processes for data collection especially at facilities– ICAT and similar tools as unifying medium– Simple metadata models for experiments
• Areas needing work– Upstream planning and synthesis– Downstream analysed data– Sharing and integration
• Drivers– Drive from Facilities (large and small)– Drive from Publishers