+ All Categories
Home > Data & Analytics > Odam: Open Data for Access and Mining

Odam: Open Data for Access and Mining

Date post: 20-Mar-2017
Category:
Upload: daniel-jacob
View: 626 times
Download: 0 times
Share this document with a friend
34
Give an open access to your data and make them ready to be mined Daniel Jacob UMR 1332 BFP – Metabolism Group Bordeaux Metabolomics Facility May 2016 Open Data for Access and Mining A data explorer as bonus EDTMS ODAM
Transcript
Page 1: Odam: Open Data for Access and Mining

Give an open access to your data and make them ready to be mined

Daniel JacobUMR 1332 BFP – Metabolism Group

Bordeaux Metabolomics FacilityMay 2016

Open Data for Access and Mining

A data explorer as bonus

EDTMS

ODAM

Page 2: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 20162

The experimental context: needs / wishes

seeding harvestingsamples preparation samples analysis

identifiers centrally

managed

data sharing&

data availability

facilitate the subsequent

data mining

1 2 3

avoid the tedious implementation of a data management system involving a data model (RDBMS)

4

Make both metadata and data available for data mining

Sample identifiers

EDTMS

ODAM

Page 3: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 20163

Data repository

Data capture Minimal effort (PUT)

PUTmyhost.org

http://myhost.org/

mount

GET

Implementation of an Experimental Data Tables Management System

(EDTMS)Experimental data tables

Merely dropping data files in a data repository (e.g. a local NAS or distant storage space) should allow users to access them by web services

Data can be downloaded, explored and mined

No database schema, no programming code and no additional configuration on the server side.

Open Data for Access and Mining : The core idea in one shot

EDTMS

ODAM

Page 4: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 20164

plants.tsv

harvests.tsv

samples.tsv

compounds.tsv

Data subset files

enzymes.tsv• Whatever the kind of experiment, this assumes a design of experiment (DoE) involving individuals, samples or whatever things, as the main objects of study (e.g. plants, tissues, bacteria, …)

• This also assumes the observation of dependent variables resulting of effects of some controlled experimental factors.

• Moreover, the objects of study have usually an identifier for each of them, and the variables can be quantitative or qualitative.

• We can have either one object type of study or several kinds, but in this latter case, it must exist a relationship between object types that we assume of “obtainedFrom" type.

Preparation and cleaning of the data sub-sets of files

EDTMS

ODAM

Page 5: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 20165

plants.tsv

harvests.tsv

samples.tsv

compounds.tsv

Classification of each column within its right category

enzymes.tsv

Data subset files

factor

quantitativequalitative

identifierlink

categories

EDTMS

ODAM

Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)

• You have to organize your data subsets so that links could be established between them. • In practical, it means to add a column containing the identifiers corresponding to the entity

to which you want to connect the subset, implying a ‘obtainedFrom’ relation. • It is to be noted that this duplication of identifiers must be the only redundant information,

through all data subsets.

Page 6: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 20166

plants.tsv harvests.tsvsamples.tsv

enzymes.tsv

Data subset files

compounds.tsv

Plants Harvests

Samples

Compounds

Enzymes

Connections between the dataset files based on identifiers

Entities(concepts)

Link between 2 subsets being carried out from identifiers(implies a ‘obtainedFrom’ relation)

Identifier of the central entity of the subset

EDTMS

ODAM

factor

quantitativequalitative

identifierlink

categories

Page 7: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 20167

Supplementary files

In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata:

For that, 2 metadata files are required

• s_subsets.tsv: a file allowing to associate with each subset of data a key concept corresponding to the main entity of the subset and the relations of the type "obtainedFrom" between these concepts

• a_attributes.tsv: a metadata file allowing each attribute (concept/variable) to be annotated with some minimal but relevant metadata

Creation of the metadata files

EDTMS

ODAM

Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)Note:TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas

Page 8: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 20168

s_subsets.tsv This metadata file allows to associate a key concept to each data subset file

Creation of the metadata files

EDTMS

ODAM

PlantsCompounds

Enzymes

Harvests

Samplesplants.tsv

PlanteIDharvests.tsv

Lot samples.tsv

SampleID

compounds.tsv

enzymes.tsv

SampleID

SampleID1

2

3

4

5

Identifier of the central entity of the subset

Link between 2 subsets (implies a ‘obtainedFrom’ relation)Unique rank number of the data subset

Key concept (i.e. the main entity) associated to the subset in the form of a short name

Plants1

factor

quantitativequalitative

identifier

categories

PlanteID plants.tsv

Data file name

Page 9: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 20169

a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with some minimal but relevant metadata

Creation of the metadata files

EDTMS

ODAM

factor

quantitativequalitative

identifier

categories

Plants

Harvests

Samples

Compounds

……

Page 10: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 2016

s_subsets.tsv

a_attributes.tsv

Additional subsets/ attributes can be added step by step, as soon as data

are produced.

Updating the metadata files

EDTMS

ODAM

Page 11: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 2016

Uploading your datasets in the data repository

EDTMS

ODAM

No database schema, no programming code and no additional configuration on the server side.

Your data subset files

Your dataset entry (named ‘frim1’ as example) within

the data repository

Z: (\\Storage)

Merely dropping data files on the data repository (e.g. NAS) should allow users to access them by web services

Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)

Data repository

PUT

myhost.orgmount

GET

Data captureMinimal effort (PUT)

Page 12: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 2016

http://myhost.org/check/frim1myhost.org

\\Storage\DataReposNAS

Checking online if your the data subset files are consistent

EDTMS

ODAM

Many test checks can be automatically

done for you

Page 13: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 2016

EDTMS

ODAM

Data storage

seeding

harvesting samples analysis

samples preparation

13

Web

Serv

ices

GET

, maximal efficiency (GET)

After depositing your complete dataset as described previously: • An open access is given to your data through web-services• They are ready to be mined• No specific code or additional configuration are needed

(*) https://www.erasysbio.net/index.php?index=266

minimal effort (PUT)

PUT

Format

TSV

Data

Data Linking

Preparation and cleaning of the data sub-sets of files

FRIM1(*)

Check

Open Data, Access and Mining : web-services

Page 14: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201614

Data

Format

TSV

EDTMS

ODAM

Data linking

Open Data, Access and Mining : web-services

Web

Serv

ices

REST Services: hierarchical tree of resource naming (URL)

Retrieving dataRetrieving metadata

<data format>

<dataset name>

<subset>(<subset>)

<entry><category>

<value> <value> <value>

<entry>

GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >

factor

quantitativequalitative

identifierlink

categories

FRIM1 (*)

xml/tsv/json

frim1

(*) https://doi.org/10.5281/zenodo.154041

Page 15: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201615

http://myhost.org/getdata/xml/frim1 http://myhost.org/getdata/xml/frim1/plants

http://myhost.org/getdata/xml/frim1/harvests/lot/1

http://myhost.org/getdata/xml/frim1/(compounds)/quantitative

Metadata

Metadata

Data

Data

Open Data Access via web-services: Examples based on FRIM1

EDTMS

ODAM

FRIM1

Page 16: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201616

http://myhost.org/getdata/xml/frim1/(samples)/treatment/Control

Set of data subsets by merging all the subsets with lower rank than the specified subset and following the pathway defined by the “obtainedFrom" links.

(samples) plants + harvests + samples

Open Data Access via web-services: Examples based on FRIM1

EDTMS

ODAM

FRIM1

Page 17: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201617

Data

Format

TSV

minimal effort, maximal efficiency

Web

Serv

ices

EDTMS

ODAM

Data linking

Open Data Access via web-services: Application layer

FRIM1

Use existing tools- Spreadsheets, R studio,

BioStatFlow, Galaxy, Cytoscape, …

Page 18: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201618

Retrieving Data within R

Open Data Access via web-services: Application layer

The R package Rodam

EDTMS

ODAM

Page 19: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201619

Open Data Access via web-services Rodam package

<data format>

<dataset name>

<subset>(<subset>)

<entry><category>

<value> <value> <value>

<entry>

tsv

frim1

samples

sample

365

GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(samples)/sample/365

Page 20: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201620

Open Data Access via web-services

Read metadatai.e. category types within the data

Get the data subset ‘activome’ along with its metadata

<data format>

<dataset name>

<subset>(<subset>)

<entry><category>

<value><value>

<entry>

tsv

frim1

activome

factor

GET http://www.bordeaux.inra.fr/pmb/getdata/tsv/frim1/(activome)/factor

Rodam package

Page 21: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201621

Open Data Access via web-services Rodam package

Page 22: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201622

Data / Metadata

Data Mining

?

Make both metadata and data

available for data mining.

Experimentation/ Analysis

MFArCCApLDA…

Open Data Access via web-services

activome qNMR_metaboWater StressControl

ODAM facilitates the subsequent data mining

All Dev. StagesAll Treatments

ODAM facilitates the subsequent data mining

(log10 transformed)

Rodam package

Page 23: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201623

Develop if needed, lightweight tools- R scripts (Galaxy), lightweight GUI (R shiny)

minimal effort, maximal efficiency

Use existing tools- Spreadsheets, R studio,

BioStatFlow, Galaxy, Cytoscape, …

EDTMS

ODAM

Data

Format

TSV

Web

Serv

ices

Data linking

Open Data Access via web-services: Application layer

FRIM1

Page 24: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201624

FRIM - Fruit Integrative Modelling

EDTMS

ODAMhttp://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1

Page 25: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201625

FRIM - Fruit Integrative Modelling

EDTMS

ODAMhttp://www.bordeaux.inra.fr/pmb/dataexplorer/?ds=frim1

Page 26: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201626

FRIM - Fruit Integrative Modelling

EDTMS

ODAM

Page 27: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201627

FRIM - Fruit Integrative Modelling

EDTMS

ODAM

To remove an item from the selection: i) click on it, and then

ii) click on the ‘Suppr’ key

Page 28: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201628

FRIM - Fruit Integrative Modelling

EDTMS

ODAM

Page 29: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 201629

FRIM - Fruit Integrative Modelling

EDTMS

ODAM

Explore several possibilities by

interacting with the graph

Page 30: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 2016

To summarize

1. Preparation and cleaning of the data sub-sets of files

2. Classification of each column within its right category

3. Connections between the dataset files based on identifiers

4. Creation of the definition files namely s_subsets.tsv and a_attributes.tsv

5. Deposit of the dataset files in the data repository

6. Checking online if your the data subset files are consistent

7. Testing online the web-services on your dataset

8. Use of the web-services through an application layer (R scripts, data explorer, ... )

EDTMS

ODAM

Data subsets files and their associated metadata files must be compliant with the TSV standard (Tab-Separator-Values)

Note:

TSV is an alternative to the common comma-separated values (CSV) format, which often causes difficulties because of the need to escape commas

(See https://en.wikipedia.org/wiki/Tab-separated_values)

Page 31: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 2016

Advantages of this approach

data sharing & data availability - The array of the "plants" may be created even before planting the seeds. - Similarly, the array of the "harvests" can be created as soon as the harvests are done,

and this before any analysis. - Thus, these arrays are generated only once in the project and we can set up the

sharing soon the seed planting. Then each analysis comes to complement the set of data as soon as they produce their own sub-dataset.

- data are accessible to everyone as soon as they are produced,

identifiers centrally managed- data are archived and compiled, so that it becomes useless to proceed a laborious

investigation to find out who possesses the right identifiers, etc.

EDTMS

ODAM

seeding harvesting samples analysis

Sample identifiers

samples preparation

Page 32: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 2016

Advantages of this approach

facilitate the subsequent publication of data- data are already readily available online by web-services,- But nothing prevents to take this data to fill in existing databases, by adjoining more

elaborate annotations.

- Neither administrator privileges nor any programmatic skills are required

EDTMS

ODAM

Data

Format

TSV

Web

Serv

ices

Data linkingPUT

GETData captureMinimal effortData analysis/mining

Maximum efficiency

Page 33: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 2016

minimal effort, maximum efficiencyFormat the data

- Based on TSV: choice to keep the good old way of scientist to use worksheets, thus i) using the same tool for both data files and metadata definition files, ii) no programmatic skill are required

Give an access through a web services layer - based on current standards (REST)

Use existing tools- Spreadsheets, R studio, BioStatFlow, Galaxy, Cytoscape, …

Develop if needed, lightweight tools- R scripts, lightweight GUI (R shiny)

Advantages of this approach

biostatflow.org

EDTMS

ODAM

Page 34: Odam: Open Data for Access and Mining

Daniel Jacob – INRA UMR 1332 –May 2016

Have a good fun !!

Daniel JacobUMR 1332 BFP – Metabolism Group

Bordeaux Metabolomics FacilityMay 2016

Open Data for Access and Mining

https://hub.docker.com/r/odam/getdata/

http://www.bordeaux.inra.fr/pmb/dataexplorer/

https://github.com/djacob65/ODAM

https://cran.r-project.org/package=Rodam

https://zenodo.org/record/154041

An online example


Recommended