Date post: | 27-Jan-2015 |
Category: |
Documents |
Upload: | alejandra-gonzalez-beltran |
View: | 115 times |
Download: | 4 times |
The Inves)ga)on/Study/Assay (ISA) metadata framework for reproducible and reusable bioscience research
Alejandra González-‐Beltrán, PhD on behalf of the ISATeam
Oxford e-‐Research Centre, University of Oxford
Faculty of Technology, Environment and Engineering Birmingham City University
12th March 2013
Ioannidis et al., Repeatability of published microarray gene expression analyses. Nature Gene*cs 41(2), 149-‐55 (2009) doi:10.1038/ng.295
Ioannidis et al., Repeatability of published microarray gene expression analyses. Nature Gene*cs 41(2), 149-‐55 (2009) doi:10.1038/ng.295
h[p://www.nature.com/news/2011/110111/full/469139a.html
h[p://www.economist.com/node/21528593 h[p://www.ny)mes.com/2011/07/08/health/research/08genes.html
Contextual informa)on (metadata): • Sample characteris)cs • Technology and measurement types • Instrument parameters • …
Need for a generic representa)on, applied to: •microarray based experiments (MAGE) •sequencing based experiments (SRA) •flow cytometry based experiments (FuGE-‐Flow Cyt) •mass spectrometry and NMR spectroscopy
experiments (Metabolights and PRIDE)
Roadmap
Reproducible & Reusable Bioscience Research
Well-‐annotated & Structured Data
reasoning
analysis
exchange
integra)on
visualiza)on
browsing retrieval
Roadmap
Reproducible & Reusable Bioscience Research
Well-‐annotated & Structured Data
reasoning
analysis
exchange
integra)on
visualiza)on
browsing retrieval
User community
Roadmap
Reproducible & Reusable Bioscience Research
Well-‐annotated & Structured Data
reasoning
analysis
exchange
integra)on
visualiza)on
browsing retrieval
Community Standards Sodware Tools
User community
Roadmap
Reproducible & Reusable Bioscience Research
reasoning
analysis
exchange
integra)on
visualiza)on
browsing retrieval
Source of the figure: EBI website
§ Interdisciplinary and integra:ve in character • need to deal with new and exis:ng datasets
• deal with a variety of data types
Bioscience is mul)-‐domain…
tox/pharma
env
health
agro
Mul)ple communi)es, mul)ple norms and standards, e.g.:
report the same core, essen)al informa)on
use the same term to refer to the same ‘thing’ allow data to flow from
one system to another
Challenges: lack of interaction and coordination, duplication of effort, fragmentation and uneven coverage…hinders interoperability
130 +
Es:mated
150 +
Source: MIBBI, EQ
UATO
R
303 +
Source: BioPortal Databases, annota)on, cura)on tools
miame!MIAPA!
MIRIAM!MIQAS!MIX!
MIGEN!
CIMR!MIAPE!
MIASE!
MIQE!
MISFISHIE….!
REMARK!
CONSORT!
MAGE-Tab!GCDML!
SRAxml!SOFT! FASTA!
DICOM!
MzML!SBRML!
SEDML…!
GELML!
ISA-Tab!
CML!
MITAB!
AAO!CHEBI!
OBI!
PATO! ENVO!MOD!
BTO!IDO…!
TEDDY!
PRO!XAO!
DO
VO!GIATE!
Growing number of bioscience repor)ng standards
But… what do we know about them and how they are related
miame!MIAPA!
MIRIAM!MIQAS!MIX!
MIGEN!
CIMR!MIAPE!
MIASE!
MIQE!
MISFISHIE….!
REMARK!
CONSORT!
MAGE-Tab!GCDML!
SRAxml!SOFT! FASTA!
DICOM!
MzML!SBRML!
SEDML…!
GELML!
ISA-Tab!
CML!
MITAB!
AAO!CHEBI!
OBI!
PATO! ENVO!MOD!
BTO!IDO…!
TEDDY!
PRO!XAO!
DO
VO!GIATE!
Which ones are mature enough for
me to use or recommend?
I work on plants, are these standards just
for biomedical applica)ons?
What are the criteria to evaluate their status and
value?
How can I get involved to propose
extensions or modifica)ons?
Which tools and databases
implement which standards?
I use high throughput sequencing technologies, which ones are relevant to
me?
Which formats support specific
minimum informa)on guidelines?
But… what do we know about them and how they are related
A coherent, curated and searchable catalogue of data sharing resources
• Bioscience standards and
associated data-‐sharing policies, publica:ons, tools and databases
• Assessment criteria for usability and popularity of standards
• Rela:onships among standards
• Encouragement for communica:on & interac:on among groups
• Promo)ng interoperability & informed decisions about standards
• Assist in the annota)on and management of experimental metadata at source, suppor)ng data provenance tracking
• Deal with high-‐throughput studies using one or a combina)on of omics and other technologies
• Empower users to uptake community-‐defined checklists and ontologies
• Facilitate data sharing, re-‐use, comparison and reproducibility of experiments, submission to interna)onal public repositories
infrastructure ISA sodware suite: suppor)ng standards-‐compliant experimental annota)on and enabling cura)on at the community level Rocca-‐Serra et al, 2010 Bioinforma)cs
faahKO dataset • Available in Bioconductor • Subset of the original data on global metabolite profiling
• LC/MS peaks from the spinal cords of 6 wild-‐type and 6 FAAH (fa[y acid amyde hydrolase) knockout mice
Saghatlian et al. Biochemistry. 2004
faahKO inves)ga)on -‐ Define key en))es (e.g. factors, protocols, parameters) -‐ Grouping of studies -‐ Relate studies and assays
faahKO study
NEWT UniProt Taxonomy Database Mouse Genome Informa)cs
-‐ Subjects studied: source(s), sampling methodology, characteris)cs -‐ treatments/manipula)ons performed to prepare the specimens
faahKO study
Mouse Adult Gross Anatomy
-‐ Subjects studied: source(s), sampling methodology, characteris)cs -‐ treatments/manipula)ons performed to prepare the specimens
Create template(s) to fit the type of experiments to be described
Create templates detailing the steps to be reported for different inves)ga)ons, complying to community standards, e.g. configuring the value(s) allowed for each field to be • text (with/without regular expression tes)ng), • ontology terms, • numbers etc.
Describe, curate your experiment using a desktop-based tool
Report and edit the description using this tool, (also customized using the templates) with a spreadsheet like look and feel, packed with functionalities such as • ontology search (access via ) • term-tagging features • import from spreadsheets etc…
• Ontology search and automated tagging (relying on NCBO Bioportal services) on Google Spreadsheets • Collabora)ve annota)on; support for distributed users • Version control & history
OntoMaton: a Bioportal powered Ontology widget for Google Spreadsheets Maguire et al, 2013 Bioinforma)cs
• R package available in BioConductor 2.11 h[p://bioconductor.org/packages/release/bioc/html/Risa.html
• ISAtab class • Read ISAtab files into ISAtab objects and write ISAtab files back to disk
• Increment metadata with defini)on factors/treatments/groups
• Build xcmsSet (xcms package) objects from mass spectrometry assays
• Augment the ISAtab dataset ader analysis • source & issues tracking
h[ps://github.com/ISA-‐tools/Risa
• faahKO package v. 2.12 contains ISAtab files describing the experiment faahkoISA = readISAta(find.package("faahKO")) assay.filename <-‐ faahkoISA["assay.filenames"][[1]] xset = processAssayXcmsSet(faahkoISA, assay.filename) … updateAssayMetadata(faahkoISA, assay.filename,"Derived Spectral Data File","faahkoDSDF.txt" )
• MTBLS2 processing and analysis using Risa, xcms and CAMERA BioConductor packages
Metabolights – an open access general-purpose repository for metabolomics studies and associated meta-data Haug et al, 2012 Nucleic Acids Research
Protocol Process
Characteristics[…] Factor Value[…] (independent variables) Material Type Comment[…]
Date (day effect)
Performer (operator effect)
Parameter Value […]
Derived Data File
Raw Data File
Data File Node
" DATA!
" Material!
Material Node
Sample Name Material Type
Hybridiza)on Assay Name Assay Design REF Array Data File Protocol REF Derived Array Data File
sample1 genomic DNA assay1 A-AFFY-107" assay1.cel data normaliza)on assay1.txt
sample2 genomic DNA assay2 A-AFFY-107" assay2.cel data normaliza)on assay2.txt
sample3 genomic DNA assay3 A-AFFY-107" assay3.cel data normaliza)on assay3.txt
Material transforma)ons...
" Material!
" DATA!
Tagging: from free text to ontology-‐based • single interven)on representa)on, free text annota)on
• single interven)on, ontology-‐based annota)on
45
Source Name Characteris)cs[organism]
Factor Value[perturba)on agent]
Factor Value[dose]
Factor Value[dura)on]
individual1 human aspirin high dose 12 weeks
Source Name Characteris)cs[organismobi:0100026)])
Term Source REF
Term Accession Number
Factor Value[chemical compound CHEBI_37577)]
Term Source REF
Term Accession Number
individual1 Homo sapiens NCBITax 9606 aspirin CHEBI 1231354
Factor Value[dose(OBI_0000984)
Term Source REF
Term Accession Number
Factor Value[)me (PATO_0000165)] Unit Term Source
REF Term Accession Number
low dose LNC LP30872-‐3 12 week UO “0000034”
Kohonen et al. The ToxBank Data Warehouse: a research cluster of 7 EU FP7 Health systems toxicology and toxicogenomics projects.
Health Care & Life Sciences Interest Group
ToxBank effort developed by Nina Jeliazkova
• Make the seman)cs of ISAtab explicit, including materials & data en))es & processes & their rela)onships
• Provide incen)ves for provision of ontology-‐based annota)ons in ISA-‐TAB datasets; exploit those annota)ons
• Augment ISA syntax with new elements (e.g. groups), facilita)ng the understanding & querying of experimental design
• Facilitate data integra)on & knowledge discovery/reasoning
architecture
ISA-‐TAB parser isa2owl mapping
parser graph analysis
Configura)on file
Implementa)on: -‐ java-‐based -‐ Using owlapi
Expe
rimen
tal
domain
Biomolecular domain
Chemical domain
Informa)on domain
vocabularies
Source Name Characteris)cs[organismobi:0100026)])
Term Source REF
Term Accession Number
Factor Value[chemical compound CHEBI_37577)]
Term Source REF
Term Accession Number
individual1 Homo sapiens NCBITax 9606 aspirin CHEBI 1231354
Source Name Characteris)cs[organismobi:0100026)])
Term Source REF
Term Accession Number
Factor Value[chemical compound CHEBI_37577)]
Term Source REF
Term Accession Number
individual1 Homo sapiens NCBITax 9606 aspirin CHEBI 1231354
OBI
GO ChEBI IAO
Open Biological and Biomedical Ontologies (OBO) Foundry BFO
Data subset: LC/MS peaks from the spinal cords of 6 wild-‐type and 6 FAAH (fa[y acid amyde hydrolase) knockout mice
faahKO dataset Available in Bioconductor (with ISA-‐TAB metadata) Global metabolite profiling
• support different conversion modes (different levels of granularity)
• querying for ISA-‐TAB datasets, across mul)ple experiment types
• reasoning exploi)ng ontology annota)ons – seman)c valida)on of ISA-‐TAB datasets
• augmented annota)on over na)ve ISA syntax – iden)fica)on gaps in ontological representa)ons – feedback of findings to community ontologies
Increasing level of structure for experimental metadata
Notes in Lab books
Spreadsheets & Tables (ISAtab metadata)
Facts as RDF statements
A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework to facilitate standards-‐compliant collec)on, cura)on, management and reuse of inves)ga)ons in an increasingly diverse set of life science domains.
Towards interoperable bioscience data Sansone et al, 2012 Nature Gene)cs
Reproducible & Reusable Bioscience Research
reasoning
analysis
exchange
integra)on
visualiza)on
browsing retrieval