+ All Categories
Home > Documents > The distribution of metadata similarity score for the ... · The distribution of metadata...

The distribution of metadata similarity score for the ... · The distribution of metadata...

Date post: 25-May-2020
Category:
Upload: others
View: 11 times
Download: 0 times
Share this document with a friend
38
Supplementary Figure 1 The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score for the different omics types: genomics, proteomics and metabolomics. The vertical lines represent the median of each distribution at values: 6.5, 3.1, and 3.8, respectively Nature Biotechnology: doi/10.1038/nbt.3790
Transcript
Page 1: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

Supplementary Figure 1

The distribution of metadata similarity score for the different omics types

The distribution of metadata similarity score for the different omics types: genomics, proteomics and metabolomics. The vertical lines represent the median of each distribution at values: 6.5, 3.1, and 3.8, respectively

Nature Biotechnology: doi/10.1038/nbt.3790

Page 2: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

Supplementary Figure 2

Distribution of shared molecules

The distribution of shared molecules (biological similarity score) for proteomics and metabolomics. The vertical lines represent the median of each distribution at values: 0.23 and 0.21, respectively

Nature Biotechnology: doi/10.1038/nbt.3790

Page 3: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

Supplementary Figure 3

MTBLS169 dataset

Dataset view including the list of the datasets similar by metadata (http://www.omicsdi.org/dataset/metabolights_dataset/MTBLS169).

Nature Biotechnology: doi/10.1038/nbt.3790

Page 4: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

Supplementary Figure 4

Chord diagram of the projects that share molecules with PRIDE dataset PRD000269 (entitled Aortic extracellular space components)

The chord diagram shows the accession numbers of the datasets that share molecules with the dataset, including the biological score.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 5: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

Supplementary Figure 5

Home webpage of OmicsDI including the different browsing boxes

(a) The “wordcloud box” can be seen as an overview of the most relevant terms in the different datasets, (b) the “biological box” provides a quick access to relevant biological metadata of the datasets such as tissues, species, and diseases, (c) the “repository box” provides an overview of the number of datasets per repository and omics type, (d) the “latest datasets box” provides a short summary of the ten most recently added datasets to OmicsDI, (e) the “most accessed datasets box” aims to provide a metric about data access and relevance of the datasets, (f) the “datasets per year box” presents the number of datasets per year, per omics type.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 6: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

Supplementary Figure 6

Search page and dataset results for the query protein UniProt protein identifier Q9HAU5

Nature Biotechnology: doi/10.1038/nbt.3790

Page 7: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

Repository Reanalyzed By other Repositories ArrayExpress 2913 PRIDE 293 MassIVE 15 Reanalysis of other datasets Expression Atlas 2913 Peptide Atlas 353 GPMDB 282

Supplementary Table 1

Number of datasets per repository including the relation ‘Reanalyzed by’ and ‘Reanalysis of’.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 8: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

Repository Other omics ArrayExpress 503 PRIDE 60 EGA 43 MetabolomeExpress 12 MetaboLights 10

Supplementary Table 2

Datasets with other related omics datasets in a different resource.

This number is generated by cross-referencing datasets annotated as resulting from the same publication.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 9: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

1

Omics Discovery Index – Discovering and Linking Public ‘Omics’ Datasets Yasset Perez-Riverol a,†,*, Mingze Bai a,b,†, Felipe da Veiga Leprevost c, Silvano Squizzato a, Young Mi Park a, Kenneth Haug a, Adam J. Carroll d, Dylan Spalding a, Justin Paschall a, Mingxun Wang e, Noemi del-Toro a, Tobias Ternent a, Peng Zhang d,f, Nicola Buso a, Nuno Bandeira e, Eric W. Deutsch g, David S Campbell g, Ronald C. Beavis h, Reza M. Salek a, Ugis Sarkans a, Robert Petryszak a , Maria Keays a , Eoin Fahy i, Manish Sud i, Shankar Subramaniam i, Ariana Barbera j, Rafael C. Jiménez k , Alexey I. Nesvizhskii c, Susanna-Assunta Sansone l, Christoph Steinbeck a, Rodrigo Lopez a, Juan Antonio Vizcaíno a, Peipei Ping m, Henning Hermjakob a,n,*a European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.b School of Bio-information, Chongqing University of Posts and Telecommunications, 400065 Chongqing, China.c Department of Pathology, University of Michigan, Ann Arbor, Michigan, 48109, USA.d Research School of Biology, Australian National University, Canberra, 0200, Australia.e Department of Computer Science and Engineering, University of California, San Diego, 9500, La Jolla, California 92093, USA.f Commonwealth Scientific and Industrial Research Organisation, Canberra, 0200, Australia.g Institute for Systems Biology, Seattle, Washington, USA.h Biochemistry & Medical Genetics, University of Manitoba, Winnipeg, R3T 2N2, Canada. i Department of Bioengineering, UC San Diego, La Jolla, CA 92093-0412, USA. j Department of Medicine, University of Cambridge, Cambridge, UK. k ELIXIR Hub, Wellcome Genome Campus, Hinxton, Cambridge, UK l Oxford e-Research Centre, University of Oxford, 7 Keble Road, OX1 3QG, UK.m Department of Physiology and Department of Medicine, Division of Cardiology, David Geffen School of Medicine at UCLA, 675 Charles E. Young Drive, MRL Building, Suite 1609, Los Angeles, California 90095, USA.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 10: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

2

n National Center for Protein Sciences Beijing, No. 38, Life Science Park Road, Changping District, 102206 Beijing.† These authors contributed equally to this work.* Corresponding authors:Dr. Yasset Perez-Riverol and Henning Hermjakob.European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 11: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

3

TableofContents

1. Omics Discovery Index: Metadata................................................................41.1 Mandatory Fields..............................................................................................51.2 Recommended Fields........................................................................................51.3 Additional Fields...............................................................................................6

2. OmicsDI XML schema..................................................................................73. OmicsDI XML validator...............................................................................9

3.1 Example of Database to OmicsDI XML..................................................................9

4. OmicsDI Software Architecture..................................................................104.1 Annotation and Normalization Component..........................................................104.2 Annotation Expansion Component.......................................................................114.3 Indexing Component and search component........................................................12

5. Similarity among datasets...............................................................................145.1 Metadata Similarity..............................................................................................155.2 Biological Similarity..............................................................................................16

6. Web interface and web service........................................................................186.1 OmicsDI RESTful Web Service and API..............................................................186.2 OmicsDI Web Application....................................................................................19

7. ddiR package...................................................................................................268. Resource Statistics...........................................................................................27

8.1 The number of datasets by Repository..................................................................278.2 Datasets with other related omics datasets............................................................278.3 Statistics of reanalysis and reuse of datasets.........................................................28

10. References.....................................................................................................29

Nature Biotechnology: doi/10.1038/nbt.3790

Page 12: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

4

1. Omics Discovery Index: Metadata Eleven resources have agreed on a common metadata structure and exchange format and have contributed to OmicsDI, including (i) major proteomics databases: PRIDE, PeptideAtlas, MassIVE and GPMDB; (ii) four major metabolomics databases: MetaboLights, GNPS, the Metabolomics Workbench and Metabolome Express; (iii) major transcriptomics databases: ArrayExpress and Expression Atlas; and (iv) the EGA (European Genome-Phenome Archive), the major European archive for genomics and phenotypic data (Table 1).Table 1: List of resources involved in the first implementation of OmicsDI metadata schema.

Database/Resource Data Type URLNumber of Datasets

(August 2016)

Update Frequen

cyPRIDE Proteomics http://www.ebi.ac.uk/pride/archive 2,688 WeeklyPeptideAtlas Proteomics http://www.peptideatlas.org 2,365 MonthlyMassIVE Proteomics http://massive.ucsd.edu 361 MonthlyGPMDB Proteomics http://gpmdb.thegpm.org 367 MonthlyMetaboLights Metabolomics http://www.ebi.ac.uk/metabolights 176 WeeklyGNPS Metabolomics http://gnps.ucsd.edu/ 318 MonthlyMetabolomics Workbench Metabolomics http://www.metabolomicsworkbench.org 283 MonthlyMetabolome Express Metabolomics https://www.metabolome-express.org/ 58 MonthlyEGA Genomics http://www.ebi.ac.uk/ega/ 1,900 WeeklyExpression Atlas Transcriptomics https://www.ebi.ac.uk/gxa/ 2,913 WeeklyArrayExpress

Transcriptomics, Genomics, Metabolomics, Proteomics

https://www.ebi.ac.uk/arrayexpress/ 66,913 Weekly

The amount and completeness of the technical and biological metadata associated to an omics dataset is a well-known issue in the biomedical community and information science7-10. In contrast to scientific publications, where a common structure is provided to describe the results of a study, datasets are potentially more heterogeneous to describe. Different guidelines 10, 11, standards and protocols have been created to standardize the metadata that needs to be provided by researchers to deposit their data, describing their experiments. However, many repositories only provide a subset of the metadata included in these guidelines and still the data is valuable and accessible. To overcome this challenge, OmicsDI defines a hierarchical metadata schema for each dataset divided in three main categories: (i) mandatory, (ii) recommended, and (iii) additional fields. The scoring system in the OmicsDI search engine boosts those datasets that provide more metadata, rewarding those groups and researchers that annotate them in a more comprehensive way. The list of recommended and additional fields increases the amount of metadata available related to a dataset. All these fields are free-text based, because not all the repositories and especially individual laboratories have bioinformatics support to use controlled vocabularies (CVs) and ontologies. However, if the original repository stores CV/ontology identifiers, they can be used as cross-reference identifiers. The

Nature Biotechnology: doi/10.1038/nbt.3790

Page 13: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

5

following tables describe the metadata fields required for each category, including examples (as XML snippets) and a short description for each case.

1.1 Mandatory Fields Table 2: Mandatory fields in the OmicsDI XML schema.

Name OmicsDI XML schema field Description XML Snippets

Repository name database name Name of the original

resource or repository<database>

<name>PRIDE</name><database>

Repository identifier entry id

Original id of the dataset in the repository

<entry id="PRD000123">

Dataset Name name The name or title of

the dataset

<name> Large scale qualitative and quantitative profiling of tyrosine phosphorylation using a combination <name>

Publication Date publication date Date of publication in

the repository

<dates> <date type="publication" value="2009-07-14"/></dates>

Submitter information submitter information

The name of the submitter and the owner of the dataset

<field name="submitter">John Smith </field>

Original URL (Universal Resource Locator)

full dataset link

The original URL provides a direct link to the repository where the original data files are stored

<field name="full_dataset_link">http://www.ebi.ac.uk/pride/archive/projects/PRD000123 </field>

Every dataset must have all this information to be included in OmicsDI. All of the mandatory fields are designed to be able to link each dataset with the original repository. This information is provided as free-text inside the XML schema. All of these fields also provide extra functionalities in OmicsDI. For example, the combination of the repository identifier and the repository name is the OmicsDI identifier, avoiding the creation of a new identifier. The existence of the mandatory fields allows the inclusion of a dataset in OmicsDI. However, if the repository only provides this limited information, it can be difficult for the users to find the dataset because of its limited metadata. This approach is similar to the one followed by PubMed for articles that do not contain an abstract (e.g. editorials, see www.ncbi.nlm.nih.gov/pubmed/24304322).

1.2 Recommended Fields Table 3: Recommended fields in the OmicsDI XML schema.

Name OmicsDI XML schema field

Description XML Snippets

Description/ description A general description about the <description>Description of the dataset

Nature Biotechnology: doi/10.1038/nbt.3790

Page 14: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

6

Abstract dataset. Similar concept to a manuscript’s abstract.

</description>

Sample Protocol Description

sample_protocol

Short description about the sample, reagents, protocols, etc

<field name="sample_protocol">Human cells culture for 20

min</field>

Data Protocol Description

data_protocol Summary related with the dataset bioinformatics analysis, data processing steps, etc

<field name="data_protocol">The R package was used to get the expression values

</field>PubMed ID pubmed PubMed ID of the manuscript or

manuscripts related with the dataset

<ref dbkey="19770167" dbname="pubmed"/>

Tissue species, tissue, cell type and

disease

Three fields that can be provided as free text

<field name="species">Homo sapiens (Human)

</field><field name="cell_type">

Epithelial cell line</field><field name="disease">

Carcinoma</field><field name="tissue">

HeLa cell</field>

Instrument instrument The instrument used to generate the data

<field name="instrument_platform">LTQ Orbitrap

</field>Omics Type omics_type This category allows the OmicsDI

to compare databases that contain similar information, but also classify the databases taking into account this information.

<field name="omics_type">Proteomics

</field>

1.3 Additional Fields Table 4: Additional fields in the OmicsDI XML schema.

Name OmicsDI XML schema

field

Description Omics Type XML Snippets

Post-translational modifications

modification Post-translational modifications associated with the dataset

Proteomics <field name="modification">

phosphorylated residue</field>

Quantitation Method

quantification_method

The analytical quantification method used in the experiment

Proteomics/Metabolomics

<field name="quantification_method">ITRAQ</field>

Taxonomy taxonomy The taxonomy identifier using the NCBI taxonomy. This property is stored as a cross-reference

All <ref dbkey="9606" dbname="TAXONOMY"/>

Protein/Metabolite Identifier

id of the entry and the

Every dataset can contain a list of

All ● <ref dbkey="Q8NBS9" dbname="uniprot"/>

Nature Biotechnology: doi/10.1038/nbt.3790

Page 15: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

7

corresponding external database

identified proteins or metabolites

● <ref dbkey="ENSP00000243108" dbname="ensembl"/>

● <ref dbkey="15555" dbname="ChEBI"/>

Dataset files URL

url of the files URLs of all the files included in a dataset.

All <field name="dataset_file">ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2010/07/PRD000123/PRIDE_Exp_Complete_Ac_9777.xml.gz</field>

Chromatographic Protocol

chromatographic_protocol

Some databases are able to provide the more details about the analytical protocols, sample preparation, etc.

All <field name="extraction_protocol"> Four volumes of solvent (methanol:ethanol, 1:1 v/v) were added.</field>

Additional fields can be defined as the remaining fields that can be used to describe the biological and technical metadata. They can be associated to particular omics types or be applicable to all types. The open XML schema of OmicsDI allows the original resources to add as much information as available. Here, we have listed some of the most important and used fields, but it can be extended on request. More examples about how to provide the metadata and how to make use of this metadata can be found in section 2.

2. OmicsDI XML schema The OmicsDI XML is derived from the EBI Search XML Schema 12. The corresponding XSD schema is available in GitHub (https://github.com/BD2K-DDI/specifications/blob/master/docs/schema/OmicsDISchema.xsd). The OmicsDI XML schema is open and is based on a key-value pair design where each piece of information is the related field name and the corresponding value. For each repository or data provider a specific header should be provided:<database> <name>PRIDE</name> <description>Database of Proteomics Experiments</description> <release>May 2015</release> <release_date>2015-05-12</release_date> <entry_count>1</entry_count>Each file should start with a database tag including the name of the database, the description (if it can be provided, it is a recommended field), the resource id, release date, and the number of entries in the file. Each repository can provide the data using one file per entry (one dataset) or adding all the datasets in just one single file. For all the provided files coming from the same repository the header should be the same. The Entry (dataset) information contains the elements <id>, <name>, and <description>.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 16: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

8

<entry id="Dataset_ID"> <name>Name of the Dataset</name> <description>Description of the dataset</description>The following elements in the schema are optional: <dates> and <additional_fields>. <dates> <date type="publication" value="2013-11-26"/> <date type="submission" value="2013-11-19"/> </dates> <additional_fields> <field name="omics_type">Proteomics</field> <field name="data_protocol">Dataset analysed with Mascot</field> <field name="instrument_platform">LTQ Orbitrap</field> <field name="instrument_platform">OFFGEL Fractionator 3100</field></additional_fields> The <dates> element contains the date’s related information: the one that is mandatory (the publication date, see section 1), but also other optional ones such as the updated or submission dates (see in the example above the existence of a submission date). The data providers can include in this element as many dates as they store. The <additional_fields> element keeps all the recommended and additional fields, which are represented using key-value pairs. For example, the data protocol is defined using the data_protocol property. An updated list of all properties currently defined by the participating repositories can be found at: https://github.com/BD2K-DDI/specifications/blob/master/docs/schema/fileds.md. <cross_references> <ref dbkey="CHEBI:16551" dbname="ChEBI"/> <ref dbkey="MTBLC16551" dbname="MetaboLights"/> <ref dbkey="CHEBI:16810" dbname="ChEBI"/> <ref dbkey="MTBLC16810" dbname="MetaboLights"/> <ref dbkey="CHEBI:30031" dbname="ChEBI"/> <ref dbkey ="9606" dbname ="TAXONOMY"/> <ref dbkey ="19770167" dbname ="pubmed"/></cross_references>The <cross_reference> field is used for representing those properties that can link to other data repositories, bioinformatics databases and literature resources mentioned in the OmicsDI XML file. For example, if the PubMed identifier is known, a cross-reference should be added referencing PubMed. A complete list of available databases for cross-references in EBI Search is available from http://www.ebi.ac.uk/ebisearch/. As a key point, the <cross-reference> element is in addition a well-defined structure to define ontology and CV terms, biological entities, etc. For every database that provides their data through the OmicsDI:

● A custom XML parser is already in place. ● No further coding is required on their side.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 17: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

9

● Data providers can decide the additional fields they would like to be indexed.

3. OmicsDI XML validator The OmicsDI XML validator (https://github.com/BD2K-DDI/xml-validator) is a Java-based tool that can be used to validate OmicsDI XML files. The current library and corresponding command-line tool provide different options for checking the files according to the metadata they provide. The library is based on the rules and specifications of the OmicsDI project and in the OmicsDI XML schema (described in Sections 1 and 2). Validation can be done either for a single XML file or for a directory containing many of them.> java -jar validatorCLI.jar --help usage: validatorCLI -check <arg> Choose validation level (default level is Warn): Warn: This category do a complete Schema and semantic validation of the file. Error: This category do a validation at level of XML Schema -inFile <arg> Input file or Directory to be processed. If the value is a directory, the procedure will be applied to all files -merge <property=value> Convert a given directory files to an out put with other options -reportFile <arg> Record error/warn messages into outfile. If not set, print message on the screen.The command-line tool allows data providers to validate their files at different levels before providing data to OmicsDI. It generates a final report containing all the corresponding fields that are missing in the different datasets.The OmicsDI XML validator also provides a Java library that enables the reading and writing of the OmicsDI XML files. The command-line software and the Java library are freely available at https://github.com/BD2K-DDI/xml-validator.

3.1 Example of Database to OmicsDI XMLThe GPMDB DDI Reader project (https://github.com/BD2K-DDI/gpmdb-ddi-reader) was developed to allow the integration of GPMDB model metadata information to the Omics DDI infrastructure. The project was designed following the ETL principles (https://en.wikipedia.org/wiki/Extract,_transform,_load), where new data is constantly queried from GPMDB ftp servers (http://gpmdb.thegpm.org/). If new projects are found, the respective files are downloaded and stored locally, cleaned, filtered and then stored in text format for indexing. The final XML files are created after all desirable information from all metadata files are grouped together, this allows the reorganization of the model files, that are data-centered, to be re organized into a project-centered architecture.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 18: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

10

4. OmicsDI Software Architecture All OmicsDI related software is written in Java, open source and available in GitHub (https://github.com/BD2K-DDI). The software architecture of OmicsDI has three main sections: (i) the annotation/enrichment components; (ii) the data integration pipeline, including indexing; and (iii) the OmicsDI web interface and web service for presenting and searching the datasets. The design of the OmicsDI architecture follows the component-based approach (https://en.wikipedia.org/wiki/Component-based_software_engineering), implementing a reuse-based approach to define, implement and compose loosely coupled independent components into systems (Figure 1). The following sections will provide a detailed description of each OmicsDI component.

Figure 1: Overview of the OmicsDI components including: (i) the annotation component; (ii) the annotation expansion component, which adds new synonyms and metadata to each dataset; (iii) the indexing component involves the insertion of each dataset in a persistent MongoDB relational database, and the indexing using the Lucene-based EBI Search system; (iv) the web service and the web interface that provide the data to the users.

4.1 Annotation and Normalization Component

Different repositories provide the metadata using different annotation standards, fields, identifiers, and levels of curation. The first step of the OmicsDI pipeline is the normalization of this metadata by using external web services and internal functions. The first normalization process is focused on four types of identifiers: (i) PubMed, (ii) proteins and metabolites, (iii) taxonomy, and (iv) ontology or CV terms.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 19: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

11

Publication annotation: The metadata related to publications can be one of the most complex pieces of information to extract. For example, in case the repositories do not store the PubMed identifier for the publications, several scenarios are possible. For instance, some resources annotate the publications as citations (e.g. Audain E, Ramos Y, Hermjakob H, Flower DR, Perez-Riverol Y. Bioinformatics. 2015 Nov 14). Some others only provide the title of the publication (e.g. Accurate estimation of Isoelectric Point of Protein and Peptide based on Amino Acid Sequences.). Using the PubMed web services and other tools, the PubMed Id is retrieved for each dataset (if it exists) and the information is annotated in the OmicsDI <cross_reference> element.

<ref dbkey="19770167" dbname="pubmed"/>This annotation process allows other steps and components in OmicsDI to retrieve the full information from the PubMed web services and the EBI Search service 12. The publication annotation component is also able to retrieve DOIs (Digital Object Identifiers) within the metadata and convert them to a PubMed identifier using a Java library (available at https://github.com/BD2K-DDI/ddi-annotation).Protein and metabolite/small molecule identifiers: Biological databases use different identifiers for representing the same biological entity. For example, in proteomics, several sequence databases can be used to perform the protein identification analysis, such as UniProt, Ensembl, etc. In this case, the UniProt identifier mapping service is used (http://www.uniprot.org/help/uploadlists). For metabolites/small molecules, the identifier normalization component uses different services such as the PubChem Identifier Exchange Service (https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange-help.html) to convert PubChem identifiers to ChEBI identifiers (https://www.ebi.ac.uk/chebi/). The standardization of the identifiers enables the comparison of biological entities present across multiple datasets. CV and ontology based identifiers: If the resource uses ontology or CV terms, the annotation component uses the Ontology Lookup Service (OLS) to annotate the proper name of the term in the OmicsDI XML file 13. In addition, the NCBI Taxonomy Service (http://www.ncbi.nlm.nih.gov/books/NBK25499/) is used to convert annotations from taxonomy identifiers to free-text and from free-text to taxonomy identifiers.

4.2 Annotation Expansion ComponentThe annotation expansion component aims to increase the metadata available for a particular dataset. For example, “Human” datasets can be referenced as “Human” in some cases and as “Homo sapiens” in others, making difficult for the indexing systems to find the datasets if the exact words are not provided in the input for the search14. This is a well-known problem in the semantic community that has been tackled by many groups14, 15. The OmicsDI annotation expansion pipeline uses two components to detect the semantically meaningful sentences in the metadata,

Nature Biotechnology: doi/10.1038/nbt.3790

Page 20: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

12

enriching it with synonyms based on ontology or CV terms. This Java-based component uses an Application Programing Interfaces (APIs) and web services from NCBO (National Center for Biomedical Ontology, http://www.bioontology.org/): ‘Annotator’17. In summary, every dataset field is enriched using these two services. The Annotator takes as input a text or a list of keywords, and suggests appropriate ontology or CV terms for them. The ontology-ranking algorithm used by the Recommender evaluates how well each ontology matches the input, using a combination of four evaluation criteria16. Annotator matches words in the text to terms in ontologies or CVs (the system selects the ontologies/CVs depending on the omics_type), by doing an exact string comparison (a “direct” match) between the text and ontology/CV term names, synonyms, and identifiers. In addition to the direct matches, the user may expand the set of matches by including others from mapped terms and from hierarchical expansion. For example, the title of the PRIDE dataset PRD000043 is: “Non-ionic Detergent Phase Extraction for the Quantitative Proteomic Analysis of Heart Membranes Proteins using Label-Free LC-MS”After the enrichment component runs, it is enriched as: “Non-ionic Detergent Phase Extraction for the Quantitative Proteomic Analysis of Heart Membranes Proteins using Label- Free LC-MS”containing a set of synonyms for every relevant word (in bold letters). For example the list for Detergent includes: Agents, Synthetic Detergent or Cleansing Agent.Some of the synonyms can be non-meaningful in the context of the dataset. However if the user searches using one of these terms, the scoring model in the indexing step (section 4.1) will always show first the dataset that contains the term as originally annotated in the text, and then the dataset that contains the terms annotated as a synonym.

4.3 Indexing Component and search componentOmicsDI uses the indexing and search infrastructure provided by the EBI Search engine. The EBI Search is based on Apache Lucene (https://lucene.apache.org/) and consists of (Figure 2): (i) the indexing system, (ii) the indexed data, and (iii) the search engine. Keeping OmicsDI up to date requires a system that automatically updates and re-indexes data following the Table 1 frequencies. When data updates are detected (e.g. new datasets are added or updated), the data are downloaded and optionally decompressed into an off-line directory18. The indexing system then generates indexes from the new data by extracting the relevant information. The indexing task is farmed out to a number of machines (EBI cluster) and each machine creates partial indexes. Once all the tasks are completed, the partial indexes are merged into the final indexed data.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 21: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

13

Figure 2: Components of the EBI Search engine (see https://www.ebi.ac.uk/ebisearch/) including the indexing engine, the indexed data, the search functionality available through a RESTful API.The EBI Search engine builds multiple index segments and merges them when required 19. For each new document indexed, new index segments are created and merged with existing ones as the result of an optimization process to keep the total number of segments low so searches remain fast. The EBI Search engine provides the capability to query multiple indexes at the same time. For instance, when the users search for specific publications, the PubMed index is used. Query Language: The Apache Lucene query syntax19, which is similar to that used by Google and other major Internet search engines, is used. Table 5 describes the major syntactical elements supported (a detailed description can be found at (http://www.ebi.ac.uk/ebisearch/documentation.ebi).Table 5: Main syntactical elements of the Lucene library used in OmicsDI.

Element

Meaning Usage Example Notes

AND In addition to

term1 AND term2

glutathione AND transferase

Matches entries where both glutathione and transferase occur.

OR Equivalence

term1 OR term2

glutathione OR transferase

Matches entries where either glutathione or transferase occur.

NOT Exclusion term1 NOT term2

coding NOT fragment

Matches entries containing coding but not fragment.

* Wildcard partialTerm*

gluta* Matches for instance glutathione, glutamate, glutamic, etc

Nature Biotechnology: doi/10.1038/nbt.3790

Page 22: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

14

" " Exact match

"quoted text"

"x-ray diffraction" Exact matching for entries containing x-ray diffraction.

( ) Grouping (text) (reductase OR transferase) AND glutathione

Field: Field-specific search

fieldId:term description:dopamine

Matches for a field description containing dopamine.

5. Similarity among datasetsWe formalize the related document search problem as follows: given a document that the user has interest in, the task is to retrieve other documents that the user may also want to examine. We think of the problem in broader terms: other documents may be interesting because they discuss similar topics, share the same literature citations, provide similar general background, lead to interesting hypotheses, etc. In fact, finding the correct dataset can be a challenging task when the number of datasets in the repository overcomes the capacity of the user to browse them one by one20. Since the creation of the first general purpose search engines such as Google and Yahoo, different metrics have been created to facilitate the search and navigation. For example, PubMed provides a “similar articles box” that enables the user to search and review other interesting manuscripts similar to the one they are interested in (Figure 3). This is possible because the service uses the common citations between manuscripts to link them21.

Figure 3: Screenshot showing the “Similar articles” functionality in PubMed. The OmicsDI platform provides two different similarity metrics for the users: (i) the metadata similarity score, and (ii) the biological similarity score. Both metrics provide a simple way of retrieving all the datasets similar to the original dataset of interest.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 23: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

15

5.1 Metadata SimilarityThe metadata similarity score uses all the metadata fields in the dataset (including title, description, sample and data protocols, species, and instruments), to compute the similarity score. In summary, the algorithm uses the following steps:

● It computes the term frequencies (TF) by tokenizing the metadata text provided for every dataset. The algorithm computes the frequency of every term in the document and it can usually find small set terms that characterize the document.

● Next, it calculates the Inverse Document Frequency (IDF)22, which is a score factor based on a term’s frequency (the number of documents which contain the term). For instance, terms that occur in fewer documents are better indicators of topic. Therefore, implementations of this method usually return larger values for rare terms, and smaller values for common terms.

● Finally, a similarity score is calculated, as the product of the IDF score and the number of times the term existed in the source document.

The similarity score is provided for every dataset and the corresponding documents in the index. The TF-IDF value increases proportionally to the number of times a given word appears in the metadata of a dataset, but is offset by the frequency of that word in the OmicsDI index, which helps to adjust for the fact that some words appear more frequently in general. Figure 4 shows the distribution of metadata similarity scores for every omics type including the median of the score values. This median value is used to filter out the very low similarity scores, by removing all the datasets with score below the median of the distribution. The overall distributions shows that genomics datasets are more correlated in terms of metadata than for example, proteomics datasets. For this reason, the algorithm applies the similarity score filter taking into account the different omics fields.

Figure 4: The distribution of metadata similarity score for the different omics types: genomics, proteomics and metabolomics. The vertical lines represent the median of each distribution at values: 6.5, 3.1, and 3.8, respectively.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 24: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

16

The ‘related datasets by metadata’ functionality provides the list of the most similar datasets using the metadata provided by the repositories such as title, description, species, instrument or identifier. For example, the dataset MTBLS169 (entitled Metabolite Profiling of wildtype and overexpression Arabidopsis thaliana) has five similar studies in OmicsDI (Figure 5).

Figure 5: MTBLS169 dataset view including the list of the datasets similar by metadata (http://www.omicsdi.org/dataset/metabolights_dataset/MTBLS169).

5.2 Biological SimilarityThe biological similarity score is based on the biological information provided in each dataset, at present protein, metabolite/small molecule and transcript/gene identifiers. Every dataset provides a set of standard identifiers that similarly to the terms in the metadata can be analysed using a mathematical model based on term frequencies and IDF 22. For the ‘dataset biological similarity’ functionality, we developed the notion of a dataset biological vector that captures the relative importance of the terms (identifiers) in a dataset. This representation of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval operations, ranging from scoring documents on a query to document classification and document clustering23, 24. The set of datasets in OmicsDI may be viewed as a set of vectors in a vector space, in which there is one axis for each term. To quantify the similarity between two documents, we first consider the magnitude of the vector difference between two dataset vectors25. To compensate for the effect of the document length, one of the standard ways of quantifying the similarity between two documents d1 and d2 is to compute the cosine similarity of their vector representations (d1) and (d2):

𝑐𝑜𝑠(𝑑', 𝑑)) = 𝑉(𝑑') ∙ 𝑉(𝑑2)|𝑉(𝑑')|×|𝑉(𝑑))|

(1)

Nature Biotechnology: doi/10.1038/nbt.3790

Page 25: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

17

where the numerator represents the dot product (also known as the inner product) of the vectors (d1) and (d2), while the denominator is the product of their Euclidean lengths. We can re-write equation 1 as:cos(𝜃) =

2345 6373

2345 68× 2

345 78 (2)

For each dataset the cosine similarity is estimated against all the datasets in the database during the metadata enrichment process. All the scores are stored in a MongoDB database. If a dataset changes (due to an update in the repository), the similarity score is estimated again.

Figure 6: The distribution of shared molecules (biological similarity score) for proteomics and metabolomics. The vertical lines represent the median of each distribution at values: 0.23 and 0.21, respectively. Figure 6 shows the distributions of similarity scores for proteomics and metabolomics. The distribution shows that more than 1000 and 195 datasets have a biological similarity score above 0.5 for Proteomics and Metabolomics, respectively. For example, the PRIDE dataset PRD000269 (http://www.ebi.ac.uk/pride/archive/projects/PRD000269) has twenty similar datasets with score above 0.5 (http://www.omicsdi.org/dataset/pride/PRD000269). The five top related datasets (biological similarity above 0.80) are proteomics studies in Aorta tissues.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 26: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

18

6. Web interface and web serviceThe design of the web interface and web service follows the “Layered Architecture Pattern” (https://en.wikipedia.org/wiki/Multilayered_architecture), where each layer serves only one main responsibility and can only access the layer below. A MongoDB database (https://www.mongodb.com) is used as the relational database for the synonym terms, the similarity scores, and the access statistics. The Lucene Indexer System (http://lucene.apache.org) is used for storing and enabling the search of datasets and related information. The database library (https://github.com/BD2K-DDI/ddi-service-db) is used to access and update the data in the MongoDB database and is used in the web service and in all the components of the submission pipeline.

6.1 OmicsDI RESTful Web Service and APIThe OmicsDI REST (Representational State Transfer)-ful web services (http://www.omicsdi.org/ws/) are implemented in Java, using the Spring framework (http://projects.spring.io/spring-framework/). Data can be accessed over HTTP (HyperText Transfer Protocol) via REST-like ‘Get’ requests, which ensures that the services are easy to use and are supported by all major platforms. JSON (JavaScript Object Notation) was chosen as the output format since it is widely used as a data serialization format. The OmicsDI Web Service API is split into several specific resources, which currently are dataset, enrichment, term, statistics, and publication. This separation is also reflected in the service URLs, where the first level after the web service root determines the resource or data type. Data retrieval options depend on the information available at each level. The Dataset entry point (which corresponds to an individual dataset in OmicsDI) provides the information about each dataset including the title, description, identifier, and database. The search method provides a general way of searching datasets in the resource by querying their associated metadata. The search capabilities should be used with pagination to enable a more efficient access to the data. Methods that make use of paging have a corresponding count method, so users can check the total number of results before deciding whether paging is necessary or not and if so, how it should be done. Then, users can combine web service functionalities to achieve more complex queries, for example search for all the datasets that match a specific query and then retrieve all the similar datasets for each dataset using the getSimilar method (http://www.omicsdi.org/ws/dataset/getSimilar). The detailed list of the current methods of the OmicsDI web service is available in Table 6. Table 6: List of the OmicsDI web service methods.

Name Description

Nature Biotechnology: doi/10.1038/nbt.3790

Page 27: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

19

/dataset/latest Retrieve the latest datasets in the repository

/dataset/getSimilar Retrieve the related datasets to one dataset

/dataset/get Retrieve an specific dataset

/dataset/search Search for datasets in the resource

/dataset/getFileLink Retrieve all file links for a given dataset

/dataset/mostAccessed Retrieve a specific dataset

/enrichment/getSimilarDatasetsByExpData Get similar datasets using the metadata for a dataset

/enrichment/getEnrichmentInfo Get enrichment, synonyms of each field

/enrichment/getSimilarityInfo Get Biological similarity information for a dataset

/enrichment/getSynonymsForDataset Get synonyms for a dataset

/statistics/tissues Return statistics about the number of datasets per tissue

/statistics/general Return general statistics about the service

/statistics/organisms Return statistics about the number of datasets per species

/statistics/diseases Return statistics about the number of datasets per disease

/statistics/omicsByYear Return statistics about the number of datasets by omics type and years

/statistics/domains Return statistics about the number of datasets per omics type

/statistics/omics Return statistics about the number of datasets per repository

/term/getTermByPattern Search dictionary terms

/term/frequentlyTerm/list Retrieve frequently terms from the repository

/publication/list Retrieve a set of publications by PubMed Identifier

6.2 OmicsDI Web Application

Nature Biotechnology: doi/10.1038/nbt.3790

Page 28: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

20

The OmicDI web application (http://www.omicsdi.org) provides access to all tools and resources of the OmicsDI ecosystem including the API, API clients, the code and the data. The OmicsDI home page provides an efficient entry point for OmicsDI using different “boxes”. Every box provides a way of to navigate the data, but also provides general statistics of the datasets and the resource (Figure 7). Each box uses a different visualization layout to present the information to the users enhancing the visualization and the interaction between the resource and the users26. All the visualization components are interactive and enable the user to navigate within OmicsDI browsing particular datasets. The “wordcloud box” can be seen as an overview of the most relevant terms in the different datasets. By clicking on particular terms or words, the service will search all the datasets that contain it (Figure 7a). To generate the wordcloud, a post-processing step is ran in the database and all the “stop words” in English are removed. Additionally, the non-relevant words for the general context (i.e. proteomics, genomics or metabolomics) are removed. The “biological box” provides a quick access to relevant biological metadata of the datasets such as tissues, species, and diseases (Figure 7b). The bubble plot provides information about how many datasets have been annotated for each particular category. The size of the bubble is proportional to the number of datasets for each particular value. In the current version of OmicsDI most of the datasets are annotated using the general tissue term “cell culture”, and the most represented species is “Human”.

Figure 7: Home webpage of OmicsDI including the different browsing boxes.The “repository box” provides an overview of the number of datasets per repository and omics type (Figure 7c). Using an interactive bar chart, the users can quickly get an overview about the number of datasets per each repository. It also presents the number of datasets per field. At the same time the numbers of datasets per each category and repository are shown as a label in each bar. These numbers can be

Nature Biotechnology: doi/10.1038/nbt.3790

Page 29: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

21

programmatically accessed using the RESTful API method (/statistics/domains) and (/statistics/omics) (Table 5). The “latest datasets box” provides a short summary of the ten most recently added datasets to OmicsDI (Figure 7d). Using this box, OmicsDI shows an up-to-date overview of all the new datasets available. It also provides a simple way of following the resources, attracting end-users to visit OmicsDI regularly. Because the latest datasets are computed and retrieved for OmicsDI as a whole, via this box users can also take a look at other resources that are not included in their usual interest. Finally, the “most accessed datasets box” aims to provide a metric about data access and relevance of the datasets. The biomedical informatics community does not yet have a standard way to provide metrics to datasets. OmicsDI traces every single access to each dataset, and shows in the box the most 20 accessed datasets with the corresponding numbers of accesses. In addition to the citation numbers for the corresponding scientific paper, these figures can be used by the community to highlight the relevance of a dataset in a specific field. Finally, the “datasets per year box” presents the number of datasets per year, per omics type (Figure 7f). Apart from the different ‘boxes’, additional functionality of the OmicsDI web interface is described below.(i) Browsing Page: At present it is also possible to browse all datasets, or to search for them using different criteria. Datasets can be searched (http://www.omicsdi.org/search?q=”:”) and filtered based on different pieces of metadata information such as title, description attributes (e.g. species, tissues, diseases, etc), instrumentation, detected protein modifications, protein and metabolite/small molecule identifiers, and publication related information (Figure 8).

Figure 8: Search page and dataset results for the query protein UniProt protein identifier Q9HAU5. All datasets that contain this specific protein are shown.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 30: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

22

(ii) Facets help to narrow down the results: The available facets (filters) of a search result are presented on the left-hand side of the “browse” page (Figure 9). Different facets are provided to filter different aspects of the metadata. The text descriptions are followed by checked boxes or filtering links, for selecting results according to specific attributes such as species, publication date, repository, tissue, disease, instrument, platform, and protein modifications (the latter only for proteomics datasets). As an example, search results can be filtered using the human ‘Organisms’ facet, and the ‘PRIDE’ repository, then the browse page will show only the human datasets available in PRIDE Archive. The result page of a search can be filtered further based on most of the same criteria, tailoring the final result to the needs of the users.

Figure 9: Facets defined to filter the datasets, to narrow down the results of the searches. Every filter can be removed and added by clicking the corresponding option in the facet. (iii) Dataset View: Each dataset has a central page containing a summary of the general metadata (Figure 10), which includes the dataset title, description, species, and instrument, among others. Crucially, it also contains the link to the original dataset in the source repository. In addition, it contains a list with all the publications related with the dataset (Figure 7b). For all publications the “publication box” shows the title, author list, abstract of the publication and also the ‘altmetrics’ score.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 31: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

23

Figure 10: The ‘Dataset view’ page shows the information of a particular dataset, and its related publications. The “dataset box” (a) shows the abstract (description) of the dataset, the sample and data protocols, species, tissues, instruments, the publication date and the original URL of the repository. The “publications box” presents all the publications related with the dataset. Additionally, it shows the ‘altmetrics’ score. Datasets Similarity Views: The presentation of similar datasets is provided in two different visualization components: (i) similar datasets view (Figure 11), and (ii) shared molecules view (Figure 12).

Nature Biotechnology: doi/10.1038/nbt.3790

Page 32: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

24

Figure 11: Similar datasets to the genomics dataset EGAS00001000805. The right column shows all the datasets containing similar experimental metadata. The “related datasets by metadata” functionality is shown in the right side of the figure and can be expanded by using the button “Load more” (Figure 11). All these related datasets have a similarity score above the threshold defined in section 5.1. The “shared molecules box” shows the datasets that share a significant amount of molecules (proteins, metabolites/small molecules or genes at present) with the dataset under study. The chord plot (Figure 12) shows all the datasets having a similarity score above 0.5. The visualisation is dynamic: users can increase this threshold to a score of up to 1.0, which is given to all datasets that share all the molecules (Figure 12).

Nature Biotechnology: doi/10.1038/nbt.3790

Page 33: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

25

Figure 12: Chord diagram of the projects that share molecules with PRIDE dataset PRD000269 (entitled Aortic extracellular space components). The chord diagram shows the accession numbers of the datasets that share molecules with the dataset, including the biological score.

Nature Biotechnology: doi/10.1038/nbt.3790

Page 34: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

26

7. ddiR packageddiR (https://github.com/BD2K-DDI/ddiR) is an open-source R-package that can be used to retrieve information from OmicsDI using the OmicsDI RESTful web services (Section 6). The ddiR allows to retrieve all the information about each dataset and also performs queries and searches in the resource (see following code):library(ddiR) datasets <- search.DatasetsSummary(query = "*:*")sink("outfile.txt") for(datasetCount in seq(from = 0, to = datasets@count, by = 100)){ datasets <- search.DatasetsSummary(query = "*:*", start = datasetCount, size = 100) for(dataset in datasets@datasets){ DatasetDetail = get.DatasetDetail([email protected], database=dataset@database) Similar = get.MetadataSimilars(accession = [email protected], database = dataset@database) rank = 0 for(similarDataset in Similar@datasets){ print(paste([email protected],[email protected],similarDataset@score, [email protected], rank)) rank = rank + 1 } } } sink()

Nature Biotechnology: doi/10.1038/nbt.3790

Page 35: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

27

8. Resource Statistics

8.1 The number of datasets by Repository

Figure 13: Number of datasets by Repository (August 2016). The statistics were generated using the ddiR package.

8.2 Datasets with other related omics datasets Table7:Datasetswithotherrelatedomicsdatasetsinadifferentresource.Thisnumber isgeneratedbycross-referencingdatasetsannotatedasresulting fromthesamepublication.Repository Other omics ArrayExpress 503 PRIDE 60 EGA 43 MetabolomeExpress 12 MetaboLights 10

Nature Biotechnology: doi/10.1038/nbt.3790

Page 36: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

28

8.3 Statistics of reanalysis and reuse of datasets Table 8: Number of datasets per repository including the relation ‘Reanalyzed by’ and ‘Reanalysis of’. Repository Reanalyzed By other Repositories ArrayExpress 2913 PRIDE 293 MassIVE 15 Reanalysis of other datasets Expression Atlas 2913 Peptide Atlas 353 GPMDB 282

Nature Biotechnology: doi/10.1038/nbt.3790

Page 37: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

29

10. References 1. Vizcaino, J.A. et al. 2016 update of the PRIDE database and its related tools.

Nucleic acids research 44, D447-456 (2016). 2. Deutsch, E.W., Lam, H. & Aebersold, R. PeptideAtlas: a resource for target

selection for emerging targeted proteomics workflows. EMBO reports 9, 429-434 (2008).

3. Craig, R., Cortens, J.P. & Beavis, R.C. Open source system for analyzing, validating, and storing protein identification data. Journal of proteome research 3, 1234-1242 (2004).

4. Haug, K. et al. MetaboLights--an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic acids research 41, D781-786 (2013).

5. Sud, M. et al. Metabolomics Workbench: An international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic acids research 44, D463-470 (2016).

6. Lappalainen, I. et al. The European Genome-phenome Archive of human data consented for biomedical research. Nature genetics 47, 692-695 (2015).

7. Han, D. et al. Trends in biomedical informatics: automated topic analysis of JAMIA articles. J Am Med Inform Assoc 22, 1153-1163 (2015).

8. Chervitz, S.A. et al. Data standards for Omics data: the basis of data sharing and reuse. Methods in molecular biology 719, 31-69 (2011).

9. Orchard, S. et al. The minimum information required for reporting a molecular interaction experiment (MIMIx). Nature biotechnology 25, 894-898 (2007).

10. Taylor, C.F. et al. The minimum information about a proteomics experiment (MIAPE). Nature biotechnology 25, 887-893 (2007).

11. Brazma, A. Minimum Information About a Microarray Experiment (MIAME)--successes, failures, challenges. ScientificWorldJournal 9, 420-423 (2009).

12. Squizzato, S. et al. The EBI Search engine: providing search and retrieval functionality for biological data from EMBL-EBI. Nucleic acids research 43, W585-588 (2015).

13. Cote, R.G., Jones, P., Martens, L., Apweiler, R. & Hermjakob, H. The Ontology Lookup Service: more data and better tools for controlled vocabulary queries. Nucleic acids research 36, W372-376 (2008).

14. Rassinoux, A.M. Knowledge representation and management: benefits and challenges of the semantic web for the fields of KRM and NLP. Yearb Med Inform 6, 121-124 (2011).

15. White, P. & Roudsari, A. An ontology for healthcare quality indicators: challenges for semantic interoperability. Studies in health technology and informatics 210, 414-418 (2015).

16. Jonquet, C., Musen, M.A. & Shah, N.H. Building a biomedical ontology recommender web service. J Biomed Semantics 1 Suppl 1, S1 (2010).

17. Jonquet, C., Shah, N.H. & Musen, M.A. The open biomedical annotator. Summit on Translat Bioinforma 2009, 56-60 (2009).

18. Valentin, F. et al. Fast and efficient searching of biological data resources--using EB-eye. Brief Bioinform 11, 375-384 (2010).

Nature Biotechnology: doi/10.1038/nbt.3790

Page 38: The distribution of metadata similarity score for the ... · The distribution of metadata similarity score for the different omics types The distribution of metadata similarity score

30

19. McCandless, M., Hatcher, E. & Gospodnetic, O. Lucene in Action: Covers Apache Lucene 3.0. (Manning Publications Co., 2010).

20. Shultz, M. Comparing test searches in PubMed and Google Scholar. J Med Libr Assoc 95, 442-445 (2007).

21. McEntyre, J. & Lipman, D. PubMed: bridging the information gap. CMAJ 164, 1317-1319 (2001).

22. Robertson, S. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of documentation 60, 503-520 (2004).

23. Berry, M.W., Drmac, Z. & Jessup, E.R. Matrices, vector spaces, and information retrieval. SIAM review 41, 335-362 (1999).

24. Singhal, A. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 24, 35-43 (2001).

25. Strehl, A., Ghosh, J. & Mooney, R. in Workshop on Artificial Intelligence for Web Search (AAAI 2000) 58-64 (2000).

26. Wang, R., Perez-Riverol, Y., Hermjakob, H. & Vizcaino, J.A. Open source libraries and frameworks for biological data visualisation: A guide for developers. Proteomics (2014).

27. Trapp, J., Armengaud, J., Salvador, A., Chaumot, A. & Geffard, O. Next-generation proteomics: toward customized biomarkers for environmental biomonitoring. Environ Sci Technol 48, 13560-13572 (2014).

Nature Biotechnology: doi/10.1038/nbt.3790


Recommended