Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 1 times |
Providing an environment where every data-driven researcher will thrive
Professor Carole [email protected] of Manchester, UK
• Pipelines– Scientific workflows over (web) services – Data pipelines, model population and
validation, simulation sweeps– Distributed, federated datasets and analyses
combined with local datasets and analysis– Opening up resources.
• e-Laboratories– Crowd-sourcing, group curating and
sharing/reusing scientific assets. – Web 2.0 and Semantic Web.– Social networking, community content,
collaborative filtering– Sharing and exchanging “Research Objects”– Opening up capabilities and capacity.
• Pan European collaboration.• Systems Biology of Microorganisms
13 projects, 91 institutes– Different research outcomes – A cross-section of microorganisms,
incl. bacteria, archaea and yeast. • Record and describe the dynamic
molecular processes occurring in microorganisms by computerized mathematical models.– Modellers meet experimentalists
• Pool research capacities, data, models and know-how.
• Retrospectively.
http://www.sysmo.net
BaCell-SysMO COSMIC
SUMO KOSMOBAC SysMO-LAB
PSYSMO Valla
MOSES TRANSLUCENT
STREAM SulfoSYS
+ two more
Data-driven• Multiple ‘omics
– genomics, transcriptomics– proteomics, metabolomics
• Images, • Reaction Kinetics• Models• Data sets + experiments + models
– SBML, Agent-based, Mechanics based• Analysis of data
Systems biology workflows in MCISB
• High throughput experimental methods
• Public data sets (e.g. EBI)• Web Services• ~ 1400 NAR January Issue
• Little databases• Lab books• Spreadsheets• Private and Shared.• Proliferation• Derived data• Long tail.
Little Data
MyDatasets My
Analytics
Big DataGroup ScienceData services
“Little” Data“Local” Science
PublishAccess
Massive decentralisation – wikis, sticks, spreadsheets
Massive centralisation – commons, clouds, curated core facilities
Tremendous fragilityDigital Dust in Data Tombs
Picking Pain Points. Keeping it Real.• Project Directors
– Data remains with us under our control.
– We control who sees what.
– Just enough exchange.• SysMO PALs
– Spreadsheets.– Yellow Pages.– Standard Operating
Procedures.
An education
Modellers vs ExperimentalistsComputational thinkingSystems thinking
Gray‘s Laws (modified)• Working Now, Working to working
– Gateways and ramps– Jam today, jam tomorrow– Just enough, just in time– Work with what you got already
• 20 questions– Is there any group generating kinetic data?– Is this data available?– Who is working with which organism?– What methods are been used to determine enzyme
activity?– Under which experimental conditions are my partners
working on for the measurement of glucose concentration?
???
?
Help people search for and
find stuff
DataServices
ProcessesModels
SoftwareExperts
SysMO SEEK Assets Catalogue. Archive. Social Network. Sharing Space. Gateway. • Yellow Pages
– People. Expertise. Projects. Institutions. Facilities. Studies.
• Data– Experimental data sets and analysed results.– Gateway to data stores – SABIO-RK, ‘omics
• Models– Store. Stimulate. Publish. Curate. – Gateway to COPASI, JWS Online, BioModels.
• Processes– Laboratory protocols – Standard Operating Procedures– Bioinformatics analyses – computational workflows - Taverna– Model population and validation – workflows – Taverna– Gateway to myExperiment, MolMeth, OpenWetWare….In
terli
nkin
g A
SSET
S C
ATA
LOG
UE
Linking data to process
Standard Operating ProceduresModelsSoftwareProvenanceThe Lab BookRetrospective method reconstructionThe myth of reproducible science
• Scientists willing to share methods and protocols.
• SOPs an early win.
• Defined standard metadata model based on Nature Protocols.
• Seeded.
Linking data with stuff• Research Objects for packaging and
exchanging Assets– Workflows linked to models linked to
data linked to SOPs – Encapsulate community standards– Mixed resources: External and central.– Trust– “Preservation Packet”– Bechhofer et al 2010 forthcoming in The Future of
The Web for Collaborative Science 2010. • SBRML
– Systems Biology Results Markup Language
– To tie to the SBML
At the coal-face
The Spreadsheet.The Content Management Systems.Legacy assets are assets.Metadata ramps.
The Content Management System
• Lightweight and flexible. Low take-on, hidden operations costs. Knowledgeable Civilians. Looks nice.
• Anarchy amenable.
Spreadsheets
• Template distribution• Template mapping
SysMOLab
Everyone wants metadata. No one wants to collect it.
Standards mayhemMetadata millstonesMost data is thrown away.
Metadata for my sakeMetadata compliance by stealthPreparation for publishing
CIMR Core Information for Metabolomics ReportingMIABE Minimal Information About a Bioactive Entity MIACA Minimal Information About a Cellular Assay MIAME Minimum Information About a Microarray Experiment MIAME/Env MIAME / Environmental transcriptomic experiment MIAME/Nutr MIAME / Nutrigenomics MIAME/Plant MIAME / Plant transcriptomics MIAME/Tox MIAME / Toxicogenomics MIAPA Minimum Information About a Phylogenetic Analysis MIAPAR Minimum Information About a Protein Affinity Reagent MIAPE Minimum Information About a Proteomics Experiment MIARE Minimum Information About a RNAi Experiment MIASE Minimum Information About a Simulation Experiment MIENS Minimum Information about an ENvironmental Sequence MIFlowCyt Minimum Information for a Flow Cytometry Experiment MIGen Minimum Information about a Genotyping Experiment MIGS Minimum Information about a Genome Sequence MIMIx Minimum Information about a Molecular Interaction Experiment MIMPP Minimal Information for Mouse Phenotyping Procedures MINI Minimum Information about a Neuroscience Investigation MINIMESS Minimal Metagenome Sequence Analysis Standard MINSEQE Minimum Information about a high-throughput SeQuencing Experiment MIPFE Minimal Information for Protein Functional Evaluation MIQAS Minimal Information for QTLs and Association Studies MIqPCR Minimum Information about a quantitative Polymerase Chain Reaction experimentMIRIAM Minimal Information Required In the Annotation of biochemical Models MISFISHIE Minimum Information Specification For In Situ Hybridization and Immunohistochemistry
ExperimentsSTRENDA Standards for Reporting Enzymology DataTBC Tox Biology Checklist
BioPAX : Biological Pathways Exchange http://www.biopax.org/FuGE Functional Genomics Experiment MGED: Microarray Experimental Conditionshttp://www.mibbi.org/index.php/MIBBI_portalMIBBI: Minimum Information for Biological and Biomedical Investigations
Minimum Information Models
63%47%
Just Enough Results Model• Harvest standards e.g.
MIAME (MIBBI.org)• Analyse consortium
schemas and spreadsheets
• JERMs for each data type – microarray, metabolomics, proteomics ....
• Map project data sources to JERMs.
• Distribute JERM spreadsheet templates
“I only want to collect and share just enough results”
JERM Spreadsheets Templates
Controlled vocabulary plug in
• RDF for ripping, mashing and comparing spreadsheets.• A little semantics goes a long way
Reward curation
Local curation at the point of capture – ISA-TAB for ‘omics.Centralised curation – SBML, CellML, SBOAutomated curation.Which data is worth curating?
• Blue-Collar Science.
• Curator Credit• Curator Career• Funding.• Personal and
institutional visibility
• Scholarly citation metrics
• Federate workloads
• Unpopular with the big data providers.
www.biocurators.org
Commons-based Quality Control.
Progressive Curation: “lazy evaluation” metadata
Just enough, Just in timeJam today and Jam tomorrow
Gain
Pain
VeryBAD
Good, butUnlikely
Just right
Sensitive sharing. Collaborate to competeGood reasons not to.
Just enough just in time sharing.Data kept at host.Registered centrally through harvesting.Pre-Publication sharing vs Publication
Competitive advantage.Academic vanity.
Adoption. Reputation.
Scrutiny.Being scooped.
Misinterpretation.Reputation.Legal issues.
Rew
ards
Risk
s
Nature 461, 145 (10 September 2009) | doi:10.1038/461145a
Access Permissions
Just Enough Sharing
Reusing myExperiment
Reward sharing and reusing not
reinventing.
Technically. Culturally. Institutionally.
Credit and Risk Mitigation.
Attribution.Trust.
Credit
Reward and Provenance
Reusing myExperiment
Some pretty key things• Data citation
• Stable and shared ids and names– A nightmare.– Sharednames.org– Biosharing.org
• Versioning and Provenance– Models, software, data sets– Ensembl web service doesn’t report version number.
Data commons, Data havens
For data after the project has ended.
For the common good or me.Tidy and untidy data.
Beth’s Provenance Objects
Bio2RDF
Access and availability of data and data analysis resources
Web services underpin the ESFRI ELIXIR programme.Interfaces that are understandable and stable.
Designed for people too.No access, no tools, no point (Keith Haines)
Deposition to community databanks that minimise pain.
What is it?
Is it working?
Data analysis, model population and data pipelining ramps.
Crossing the adoption chasmThere is a world of complexity for data preparation, processing and analysisScience Informatics Sweatshops.E-Laboratories. Workflows. Portals.Pre-cooked processes and process templates. Pre-cooked interfaces.Training.
MicroArray from
tumor tissue
Microarray
preProcessing
Lymphoma
prediction
Lymphoma Prediction Workflow
Wei Tan Univ. Chicago
Ack. Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI)Jared Nedzel (MIT)
caArray
GenePattern
Use gene-expression
patterns associated with two lymphoma types to predict the type of an
unknown sample.
myExperiment Communities
• Supermarket shoppers
• Tool builders
• Trainers and Trainees
Drop and ComputeDrop and Compute
Local folder synchronised and shared via cloud
Condor job submitted by drag and drop
Results appear in Dropbox
Ian Cottam
Bashing against local IT
NO – you can’t access that datastore / run your
analysis. Joined up thinking.
Data + PublicationsData trapped in documentsSupplemental informationText miningText mining workflowsText mining to find method and controls
Reflect. Elsevier Challenge Winner 2009
Manual and Auto-mark up[Oscar-3]
Do not underestimate the power of Interactive Visualisation and BrowsingPre-cooked complex queries.Navigation.With my data.At the click of a button.
• Distributed Annotation Service• Upload and overlay my data
SysMO summary• Providing an environment where every data-driven
researcher will thrive• Reality is messy.
– Extreme Technology Determinism vs Voluntarist Sociocultural shaping
• Extreme and continuous partnership with users.– Act Local Think Global
• Agile development environment facilitated stream of features to tackle pain points.– Leverage other e-Laboratories, Maintaining scientists’ buy-in.
• Socio-Political Axis dominates the Technical Axis.– Collaboration evolutions, Confidence in exchange.
Coordination
Sustainabi
lity
Interope
rability
Adoption
Capacity
Data
Six Action Plan Areas
Capacity building of our skills base
• Influence training and capacity building programmes.
• Promote training for young and mid-career researchers and research technologists.
• Enable mixed skilled research teams to include research and information technologists.
• Value and reward highly skilled research and information technologists within HE institutions with a career structure.
Data Silo culture
Funding silos
Discipline silos
Academic Credit and Risk
Mitigation
for sharing, curating, and reusing not reinventing
Data and Software is free like puppies
are free
University of Stellenbosch, South AfricaUniversity of Manchester, UK
Jacky Snoep
EML Research gGmbH, Germany
Isabel Rojas
University of Manchester, UK
Olga Krebs
Wolfgang Müller
Sergejs Aleksejevs
Carole Goble
Stuart Owen
Katy Wolstencroft
Finn Bacal
Links• myGrid Project
– http://www.mygrid.org.uk
• SysMO-DB– http://www.sysmo-db.org
• myExperiment– http://www.myexperiment.org
• Taverna– http://www.taverna.org.uk
• JWS Online– http://jjj.biochem.sun.ac.za/
• SABIO-RK– http://sabio.villa-bosch.de/