Odam: Open Data for Access and Mining

Give an open access to your data and make them ready to be mined

Daniel JacobUMR 1332 BFP – Metabolism Group

Bordeaux Metabolomics FacilityMay 2016

Open Data for Access and Mining

A data explorer as bonus

EDTMS

ODAM

Daniel Jacob – INRA UMR 1332 –May 20162

The experimental context: needs / wishes

seeding harvestingsamples preparation samples analysis

identifiers centrally

managed

data sharing&

data availability

facilitate the subsequent

data mining

1 2 3

avoid the tedious implementation of a data management system involving a data model (RDBMS)

4

Make both metadata and data available for data mining

Sample identifiers

EDTMS

ODAM


Data repository

Data capture Minimal effort (PUT)

PUTmyhost.org

http://myhost.org/

mount

GET

Implementation of an Experimental Data Tables Management System

(EDTMS)Experimental data tables

Merely dropping data files in a data repository (e.g. a local NAS or distant storage space) should allow users to access them by web services

Data can be downloaded, explored and mined

No database schema, no programming code and no additional configuration on the server side.

Open Data for Access and Mining : The core idea in one shot

EDTMS

ODAM


plants.tsv

harvests.tsv

samples.tsv

compounds.tsv

Data subset files

enzymes.tsv• Whatever the kind of experiment, this assumes a design of experiment (DoE) involving individuals, samples or whatever things, as the main objects of study (e.g. plants, tissues, bacteria, …)

• This also assumes the observation of dependent variables resulting of effects of some controlled experimental factors.

• Moreover, the objects of study have usually an identifier for each of them, and the variables can be quantitative or qualitative.

• We can have either one object type of study or several kinds, but in this latter case, it must exist a relationship between object types that we assume of “obtainedFrom" type.

Preparation and cleaning of the data sub-sets of files

EDTMS

ODAM


plants.tsv

harvests.tsv

samples.tsv

compounds.tsv

Classification of each column within its right category

enzymes.tsv

Data subset files

factor

quantitativequalitative

identifierlink

categories

EDTMS

ODAM

Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)

• You have to organize your data subsets so that links could be established between them. • In practical, it means to add a column containing the identifiers corresponding to the entity

to which you want to connect the subset, implying a ‘obtainedFrom’ relation. • It is to be noted that this duplication of identifiers must be the only redundant information,

through all data subsets.

https://en.wikipedia.org/wiki/Tab-separated_values


plants.tsv harvests.tsvsamples.tsv

enzymes.tsv

Data subset files

compounds.tsv

Plants Harvests

Samples

Compounds

Enzymes

Connections between the dataset files based on identifiers

Entities(concepts)

Link between 2 subsets being carried out from identifiers(implies a ‘obtainedFrom’ relation)

Identifier of the central entity of the subset

EDTMS

ODAM

factor


identifierlink

categories


Supplementary files

In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata:

For that, 2 metadata files are required

• s_subsets.tsv: a file allowing to associate with each subset of data a key concept corresponding to the main entity of the subset and the relations of the type "obtainedFrom" between these concepts

• a_attributes.tsv: a metadata file allowing each attribute (concept/variable) to be annotated with some minimal but relevant metadata

Creation of the metadata files

EDTMS

ODAM

Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)Note:TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas

https://en.wikipedia.org/wiki/Tab-separated_values


s_subsets.tsv This metadata file allows to associate a key concept to each data subset file


EDTMS

ODAM

PlantsCompounds

Enzymes

Harvests

Samplesplants.tsv

PlanteIDharvests.tsv

Lot samples.tsv

SampleID

compounds.tsv

enzymes.tsv

SampleID

SampleID1

2

3

4

5

Identifier of the central entity of the subset

Link between 2 subsets (implies a ‘obtainedFrom’ relation)Unique rank number of the data subset

Key concept (i.e. the main entity) associated to the subset in the form of a short name

Plants1

factor


identifier

categories

PlanteID plants.tsv

Data file name


a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with some minimal but relevant metadata


EDTMS

ODAM

factor


identifier

categories

Plants

Harvests

Samples

Compounds

……


s_subsets.tsv

a_attributes.tsv

…

…

Additional subsets/ attributes can be added step by step, as soon as data

are produced.

Updating the metadata files

EDTMS

ODAM


Uploading your datasets in the data repository

EDTMS

ODAM

No database schema, no programming code and no additional configuration on the server side.

Your data subset files

Your dataset entry (named ‘frim1’ as example) within

the data repository

Z: (\\Storage)

Merely dropping data files on the data repository (e.g. NAS) should allow users to access them by web services


Data repository

PUT

myhost.orgmount

GET

Data captureMinimal effort (PUT)


http://myhost.org/check/frim1myhost.org

\\Storage\DataReposNAS

Checking online if your the data subset files are consistent

EDTMS

ODAM

Many test checks can be automatically

done for you


EDTMS

ODAM

Data storage

seeding

harvesting samples analysis

samples preparation

13

Web

Serv

ices

GET

, maximal efficiency (GET)

After depositing your complete dataset as described previously: • An open access is given to your data through web-services• They are ready to be mined• No specific code or additional configuration are needed

(*) https://www.erasysbio.net/index.php?index=266

minimal effort (PUT)

PUT

Format

TSV

Data

Data Linking

Preparation and cleaning of the data sub-sets of files

FRIM1(*)

Check

Open Data, Access and Mining : web-services


Data

Format

TSV

EDTMS

ODAM

Data linking

Open Data, Access and Mining : web-services

Web

Serv

ices

REST Services: hierarchical tree of resource naming (URL)

Retrieving dataRetrieving metadata

<data format>

<dataset name>

<subset>(<subset>)

<entry><category>

<value> <value> <value>

<entry>

GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >

factor


identifierlink

categories

FRIM1 (*)

xml/tsv/json

frim1

(*) https://doi.org/10.5281/zenodo.154041


http://myhost.org/getdata/xml/frim1 http://myhost.org/getdata/xml/frim1/plants

http://myhost.org/getdata/xml/frim1/harvests/lot/1

http://myhost.org/getdata/xml/frim1/(compounds)/quantitative

Metadata

Metadata

Data

Data

Open Data Access via web-services: Examples based on FRIM1

EDTMS

ODAM

FRIM1


http://myhost.org/getdata/xml/frim1/(samples)/treatment/Control

Set of data subsets by merging all the subsets with lower rank than the specified subset and following the pathway defined by the “obtainedFrom" links.

(samples) plants + harvests + samples

Open Data Access via web-services: Examples based on FRIM1

EDTMS

ODAM

FRIM1


Data

Format

TSV

minimal effort, maximal efficiency

Web

Serv

ices

EDTMS

ODAM

Data linking

Open Data Access via web-services: Application layer

FRIM1

…

Use existing tools- Spreadsheets, R studio,

BioStatFlow, Galaxy, Cytoscape, …


Retrieving Data within R


The R package Rodam

EDTMS

ODAM


Open Data Access via web-services Rodam package

<data format>

<dataset name>

<subset>(<subset>)

<entry><category>

<value> <value> <value>

<entry>

tsv

frim1

samples

sample

365

GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(samples)/sample/365


Open Data Access via web-services

Read metadatai.e. category types within the data

Get the data subset ‘activome’ along with its metadata

<data format>

<dataset name>

<subset>(<subset>)

<entry><category>

<value><value>

<entry>

tsv

frim1

activome

factor

GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(activome)/factor

Rodam package


Open Data Access via web-services Rodam package


Data / Metadata

Data Mining

?

Make both metadata and data

available for data mining.

Experimentation/ Analysis

MFArCCApLDA…

Open Data Access via web-services

activome qNMR_metaboWater StressControl

ODAM facilitates the subsequent data mining

All Dev. StagesAll Treatments

ODAM facilitates the subsequent data mining

(log10 transformed)

Rodam package


Develop if needed, lightweight tools- R scripts (Galaxy), lightweight GUI (R shiny)

minimal effort, maximal efficiency

…

Use existing tools- Spreadsheets, R studio,

BioStatFlow, Galaxy, Cytoscape, …

EDTMS

ODAM

Data

Format

TSV

Web

Serv

ices

Data linking


FRIM1


FRIM - Fruit Integrative Modelling

EDTMS

ODAMhttp://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1



EDTMS

ODAMhttp://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1



EDTMS

ODAM



EDTMS

ODAM

To remove an item from the selection: i) click on it, and then

ii) click on the ‘Suppr’ key



EDTMS

ODAM



EDTMS

ODAM

Explore several possibilities by

interacting with the graph


To summarize

1. Preparation and cleaning of the data sub-sets of files

2. Classification of each column within its right category

3. Connections between the dataset files based on identifiers

4. Creation of the definition files namely s_subsets.tsv and a_attributes.tsv

5. Deposit of the dataset files in the data repository

6. Checking online if your the data subset files are consistent

7. Testing online the web-services on your dataset

8. Use of the web-services through an application layer (R scripts, data explorer, ... )

EDTMS

ODAM


Note:

TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas

(See https://en.wikipedia.org/wiki/Tab-separated_values)


Advantages of this approach

data sharing & data availability - The array of the "plants" may be created even before planting the seeds. - Similarly, the array of the "harvests" can be created as soon as the harvests are done,

and this before any analysis. - Thus, these arrays are generated only once in the project and we can set up the

sharing soon the seed planting. Then each analysis comes to complement the set of data as soon as they produce their own sub-dataset.

- data are accessible to everyone as soon as they are produced,

identifiers centrally managed- data are archived and compiled, so that it becomes useless to proceed a laborious

investigation to find out who possesses the right identifiers, etc.

EDTMS

ODAM

seeding harvesting samples analysis

Sample identifiers

samples preparation



facilitate the subsequent publication of data- data are already readily available online by web-services,- But nothing prevents to take this data to fill in existing databases, by adjoining more

elaborate annotations.

- Neither administrator privileges nor any programmatic skills are required

EDTMS

ODAM

Data

Format

TSV

Web

Serv

ices

Data linkingPUT

GETData captureMinimal effortData analysis/mining

Maximum efficiency


minimal effort, maximum efficiencyFormat the data

- Based on TSV: choice to keep the good old way of scientist to use worksheets, thus i) using the same tool for both data files and metadata definition files, ii) no programmatic skill are required

Give an access through a web services layer - based on current standards (REST)

Use existing tools- Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, …

Develop if needed, lightweight tools- R scripts, lightweight GUI (R shiny)


biostatflow.org

EDTMS

ODAM


Have a good fun !!

Daniel JacobUMR 1332 BFP – Metabolism Group

Bordeaux Metabolomics FacilityMay 2016

Open Data for Access and Mining

https://hub.docker.com/r/odam/getdata/

http://www.bordeaux.inra.fr/pmb/dataexplorer/

https://github.com/djacob65/ODAM

https://cran.r-project.org/package=Rodam

https://zenodo.org/record/154041

An online example

http://dx.doi.org/10.5281/zenodo.17819

Date post:	20-Mar-2017
Category:	Data & Analytics
Upload:	daniel-jacob
View:	626 times
Download:	0 times

Odam: Open Data for Access and Mining

Data & Analytics