Date post: | 20-Mar-2017 |
Category: |
Data & Analytics |
Upload: | daniel-jacob |
View: | 626 times |
Download: | 0 times |
Give an open access to your data and make them ready to be mined
Daniel JacobUMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics FacilityMay 2016
Open Data for Access and Mining
A data explorer as bonus
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 20162
The experimental context: needs / wishes
seeding harvestingsamples preparation samples analysis
identifiers centrally
managed
data sharing&
data availability
facilitate the subsequent
data mining
1 2 3
avoid the tedious implementation of a data management system involving a data model (RDBMS)
4
Make both metadata and data available for data mining
Sample identifiers
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 20163
Data repository
Data capture Minimal effort (PUT)
PUTmyhost.org
http://myhost.org/
mount
GET
Implementation of an Experimental Data Tables Management System
(EDTMS)Experimental data tables
Merely dropping data files in a data repository (e.g. a local NAS or distant storage space) should allow users to access them by web services
Data can be downloaded, explored and mined
No database schema, no programming code and no additional configuration on the server side.
Open Data for Access and Mining : The core idea in one shot
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 20164
plants.tsv
harvests.tsv
samples.tsv
compounds.tsv
Data subset files
enzymes.tsv• Whatever the kind of experiment, this assumes a design of experiment (DoE) involving individuals, samples or whatever things, as the main objects of study (e.g. plants, tissues, bacteria, …)
• This also assumes the observation of dependent variables resulting of effects of some controlled experimental factors.
• Moreover, the objects of study have usually an identifier for each of them, and the variables can be quantitative or qualitative.
• We can have either one object type of study or several kinds, but in this latter case, it must exist a relationship between object types that we assume of “obtainedFrom" type.
Preparation and cleaning of the data sub-sets of files
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 20165
plants.tsv
harvests.tsv
samples.tsv
compounds.tsv
Classification of each column within its right category
enzymes.tsv
Data subset files
factor
quantitativequalitative
identifierlink
categories
EDTMS
ODAM
Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)
• You have to organize your data subsets so that links could be established between them. • In practical, it means to add a column containing the identifiers corresponding to the entity
to which you want to connect the subset, implying a ‘obtainedFrom’ relation. • It is to be noted that this duplication of identifiers must be the only redundant information,
through all data subsets.
Daniel Jacob – INRA UMR 1332 –May 20166
plants.tsv harvests.tsvsamples.tsv
enzymes.tsv
Data subset files
compounds.tsv
Plants Harvests
Samples
Compounds
Enzymes
Connections between the dataset files based on identifiers
Entities(concepts)
Link between 2 subsets being carried out from identifiers(implies a ‘obtainedFrom’ relation)
Identifier of the central entity of the subset
EDTMS
ODAM
factor
quantitativequalitative
identifierlink
categories
Daniel Jacob – INRA UMR 1332 –May 20167
Supplementary files
In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata:
For that, 2 metadata files are required
• s_subsets.tsv: a file allowing to associate with each subset of data a key concept corresponding to the main entity of the subset and the relations of the type "obtainedFrom" between these concepts
• a_attributes.tsv: a metadata file allowing each attribute (concept/variable) to be annotated with some minimal but relevant metadata
Creation of the metadata files
EDTMS
ODAM
Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)Note:TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
Daniel Jacob – INRA UMR 1332 –May 20168
s_subsets.tsv This metadata file allows to associate a key concept to each data subset file
Creation of the metadata files
EDTMS
ODAM
PlantsCompounds
Enzymes
Harvests
Samplesplants.tsv
PlanteIDharvests.tsv
Lot samples.tsv
SampleID
compounds.tsv
enzymes.tsv
SampleID
SampleID1
2
3
4
5
Identifier of the central entity of the subset
Link between 2 subsets (implies a ‘obtainedFrom’ relation)Unique rank number of the data subset
Key concept (i.e. the main entity) associated to the subset in the form of a short name
Plants1
factor
quantitativequalitative
identifier
categories
PlanteID plants.tsv
Data file name
Daniel Jacob – INRA UMR 1332 –May 20169
a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with some minimal but relevant metadata
Creation of the metadata files
EDTMS
ODAM
factor
quantitativequalitative
identifier
categories
Plants
Harvests
Samples
Compounds
……
Daniel Jacob – INRA UMR 1332 –May 2016
s_subsets.tsv
a_attributes.tsv
…
…
Additional subsets/ attributes can be added step by step, as soon as data
are produced.
Updating the metadata files
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 2016
Uploading your datasets in the data repository
EDTMS
ODAM
No database schema, no programming code and no additional configuration on the server side.
Your data subset files
Your dataset entry (named ‘frim1’ as example) within
the data repository
Z: (\\Storage)
Merely dropping data files on the data repository (e.g. NAS) should allow users to access them by web services
Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)
Data repository
PUT
myhost.orgmount
GET
Data captureMinimal effort (PUT)
Daniel Jacob – INRA UMR 1332 –May 2016
http://myhost.org/check/frim1myhost.org
\\Storage\DataReposNAS
Checking online if your the data subset files are consistent
EDTMS
ODAM
Many test checks can be automatically
done for you
Daniel Jacob – INRA UMR 1332 –May 2016
EDTMS
ODAM
Data storage
seeding
harvesting samples analysis
samples preparation
13
Web
Serv
ices
GET
, maximal efficiency (GET)
After depositing your complete dataset as described previously: • An open access is given to your data through web-services• They are ready to be mined• No specific code or additional configuration are needed
(*) https://www.erasysbio.net/index.php?index=266
minimal effort (PUT)
PUT
Format
TSV
Data
Data Linking
Preparation and cleaning of the data sub-sets of files
FRIM1(*)
Check
Open Data, Access and Mining : web-services
Daniel Jacob – INRA UMR 1332 –May 201614
Data
Format
TSV
EDTMS
ODAM
Data linking
Open Data, Access and Mining : web-services
Web
Serv
ices
REST Services: hierarchical tree of resource naming (URL)
Retrieving dataRetrieving metadata
<data format>
<dataset name>
<subset>(<subset>)
<entry><category>
<value> <value> <value>
<entry>
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
factor
quantitativequalitative
identifierlink
categories
FRIM1 (*)
xml/tsv/json
frim1
(*) https://doi.org/10.5281/zenodo.154041
Daniel Jacob – INRA UMR 1332 –May 201615
http://myhost.org/getdata/xml/frim1 http://myhost.org/getdata/xml/frim1/plants
http://myhost.org/getdata/xml/frim1/harvests/lot/1
http://myhost.org/getdata/xml/frim1/(compounds)/quantitative
Metadata
Metadata
Data
Data
Open Data Access via web-services: Examples based on FRIM1
EDTMS
ODAM
FRIM1
Daniel Jacob – INRA UMR 1332 –May 201616
http://myhost.org/getdata/xml/frim1/(samples)/treatment/Control
Set of data subsets by merging all the subsets with lower rank than the specified subset and following the pathway defined by the “obtainedFrom" links.
(samples) plants + harvests + samples
Open Data Access via web-services: Examples based on FRIM1
EDTMS
ODAM
FRIM1
Daniel Jacob – INRA UMR 1332 –May 201617
Data
Format
TSV
minimal effort, maximal efficiency
Web
Serv
ices
EDTMS
ODAM
Data linking
Open Data Access via web-services: Application layer
FRIM1
…
Use existing tools- Spreadsheets, R studio,
BioStatFlow, Galaxy, Cytoscape, …
Daniel Jacob – INRA UMR 1332 –May 201618
Retrieving Data within R
Open Data Access via web-services: Application layer
The R package Rodam
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 201619
Open Data Access via web-services Rodam package
<data format>
<dataset name>
<subset>(<subset>)
<entry><category>
<value> <value> <value>
<entry>
tsv
frim1
samples
sample
365
GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(samples)/sample/365
Daniel Jacob – INRA UMR 1332 –May 201620
Open Data Access via web-services
Read metadatai.e. category types within the data
Get the data subset ‘activome’ along with its metadata
<data format>
<dataset name>
<subset>(<subset>)
<entry><category>
<value><value>
<entry>
tsv
frim1
activome
factor
GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(activome)/factor
Rodam package
Daniel Jacob – INRA UMR 1332 –May 201621
Open Data Access via web-services Rodam package
Daniel Jacob – INRA UMR 1332 –May 201622
Data / Metadata
Data Mining
?
Make both metadata and data
available for data mining.
Experimentation/ Analysis
MFArCCApLDA…
Open Data Access via web-services
activome qNMR_metaboWater StressControl
ODAM facilitates the subsequent data mining
All Dev. StagesAll Treatments
ODAM facilitates the subsequent data mining
(log10 transformed)
Rodam package
Daniel Jacob – INRA UMR 1332 –May 201623
Develop if needed, lightweight tools- R scripts (Galaxy), lightweight GUI (R shiny)
minimal effort, maximal efficiency
…
Use existing tools- Spreadsheets, R studio,
BioStatFlow, Galaxy, Cytoscape, …
EDTMS
ODAM
Data
Format
TSV
Web
Serv
ices
Data linking
Open Data Access via web-services: Application layer
FRIM1
Daniel Jacob – INRA UMR 1332 –May 201624
FRIM - Fruit Integrative Modelling
EDTMS
ODAMhttp://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
Daniel Jacob – INRA UMR 1332 –May 201625
FRIM - Fruit Integrative Modelling
EDTMS
ODAMhttp://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1
Daniel Jacob – INRA UMR 1332 –May 201626
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 201627
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
To remove an item from the selection: i) click on it, and then
ii) click on the ‘Suppr’ key
Daniel Jacob – INRA UMR 1332 –May 201628
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 201629
FRIM - Fruit Integrative Modelling
EDTMS
ODAM
Explore several possibilities by
interacting with the graph
Daniel Jacob – INRA UMR 1332 –May 2016
To summarize
1. Preparation and cleaning of the data sub-sets of files
2. Classification of each column within its right category
3. Connections between the dataset files based on identifiers
4. Creation of the definition files namely s_subsets.tsv and a_attributes.tsv
5. Deposit of the dataset files in the data repository
6. Checking online if your the data subset files are consistent
7. Testing online the web-services on your dataset
8. Use of the web-services through an application layer (R scripts, data explorer, ... )
EDTMS
ODAM
Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)
Note:
TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas
(See https://en.wikipedia.org/wiki/Tab-separated_values)
Daniel Jacob – INRA UMR 1332 –May 2016
Advantages of this approach
data sharing & data availability - The array of the "plants" may be created even before planting the seeds. - Similarly, the array of the "harvests" can be created as soon as the harvests are done,
and this before any analysis. - Thus, these arrays are generated only once in the project and we can set up the
sharing soon the seed planting. Then each analysis comes to complement the set of data as soon as they produce their own sub-dataset.
- data are accessible to everyone as soon as they are produced,
identifiers centrally managed- data are archived and compiled, so that it becomes useless to proceed a laborious
investigation to find out who possesses the right identifiers, etc.
EDTMS
ODAM
seeding harvesting samples analysis
Sample identifiers
samples preparation
Daniel Jacob – INRA UMR 1332 –May 2016
Advantages of this approach
facilitate the subsequent publication of data- data are already readily available online by web-services,- But nothing prevents to take this data to fill in existing databases, by adjoining more
elaborate annotations.
- Neither administrator privileges nor any programmatic skills are required
EDTMS
ODAM
Data
Format
TSV
Web
Serv
ices
Data linkingPUT
GETData captureMinimal effortData analysis/mining
Maximum efficiency
Daniel Jacob – INRA UMR 1332 –May 2016
minimal effort, maximum efficiencyFormat the data
- Based on TSV: choice to keep the good old way of scientist to use worksheets, thus i) using the same tool for both data files and metadata definition files, ii) no programmatic skill are required
Give an access through a web services layer - based on current standards (REST)
Use existing tools- Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, …
Develop if needed, lightweight tools- R scripts, lightweight GUI (R shiny)
Advantages of this approach
biostatflow.org
EDTMS
ODAM
Daniel Jacob – INRA UMR 1332 –May 2016
Have a good fun !!
Daniel JacobUMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics FacilityMay 2016
Open Data for Access and Mining
https://hub.docker.com/r/odam/getdata/
http://www.bordeaux.inra.fr/pmb/dataexplorer/
https://github.com/djacob65/ODAM
https://cran.r-project.org/package=Rodam
https://zenodo.org/record/154041
An online example