Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

transcript

Building a network of interoperable and independently produced linked and open

biomedical data

Michel Dumontier, Ph.D.

Associate Professor of Medicine (Biomedical Informatics)Stanford University

@micheldumontier::ACS:23-08-16An invited talk in support of the 2016 Herman Skolnik Awardees

@micheldumontier::ACS:23-08-162

My research aims to develop computational methods for biomedical knowledge discovery

We develop tools and methods to represent, store, publish, integrate, query, and reuse biomedical data, software, and ontologies

reuse needs to be considered firmly in the context of discovery and

reproducibility

Most published research findings are false- John Ioannidis, Stanford University

Reproducible discovery

1. Data Science Tools and Methods– Infrastructure: To identify, annotate, link, integrate,

search for and query data and services– Tools: To identify and uncover support for known or

novel associations2. Community Standards to contribute to and interrogate a massive, decentralized network of interconnected data and software

FAIR: Findable, Accessible, Interoperable, Re-usable

Findable– Globally unique identifiers for datasets and the data they contain– Rich set of descriptors to search and filter with– Indexed and searchable

Accessible– Identifiers can be used to retrieve representations using standard protocols

(e.g. HTTP)– Metadata is always available.

Interoperable– Data represented with formal knowledge representations– Include links to other datasets/vocabularies

Reusable– Licensing, Provenance, Community standards

The Semantic Web is the new global web of knowledge

standards for publishing, sharing and querying facts, expert knowledge and services

scalable approach for the discoveryof independently formulated

and distributed knowledge

Linked Data offers a solid foundation for FAIR data

• Entities (people, proteins, pathways, etc) are identified using globally unique identifiers (URIs)

• Entity descriptions are represented with a standardized language (RDF)

• Data can be retrieved using a universal protocol (HTTP)

• Entities (concepts, data, resources) can be linked together to increase interoperability

Linked Data for the Life Sciences

Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF.

chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications

• 11B+ interlinked statements from 35 biomedical datasets and 400+ ontologies

• dataset description, provenance & statistics• A growing interoperable ecosystem with the EBI,

NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool providers

Bio2RDF normalizes identifiers, formats, links, and access

Bio2RDF shows how datasets are connected together

Queries can be federated across private and public SPARQL databases

Get all protein catabolic processes (and more specific GO terms) in biomodels

SELECT ?go ?label count(distinct ?x) WHERE { service <http://bioportal.bio2rdf.org/sparql> { ?go rdfs:label ?label . ?go rdfs:subClassOf+ ?tgo ?tgo rdfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") } service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . }}

Graph-like representation amenableto finding mismatches and discovering new links

W Hu, H Qiu, M Dumontier. Link Analysis of Life Science Linked Data. International Semantic Web Conference (2) 2015: 446-462.

EbolaKB Using Linked Data and Software

Kamdar, Dumontier. An Ebola virus-centered knowledge base. Database. 2015 Jun 8;2015. doi: 10.1093/database/bav049.

Network analysis and discovery

McCusker, McGuiness, Dumontier. In prep.

Can we implement an open version of PREDICT using Linked Data?

AUC 0.91 across all therapeutic indications

A. Chemical structure Similarity

B. Side Effect Similarity

C. Target Sequence Similarity

D. Target Functional Similarity

E. Network Distance

A. Phenotype Based

B. Text Extracted Concepts

Disease-disease similarityDrug-drug similarity

HyQue: Hypothesis Validation

• A platform for knowledge discovery that uses data retrieval coupled with automated reasoning to validate scientific hypotheses

• Leverages semantic technologies to provide access to linked data, ontologies, and semantic web services

• Uses positive and negative findings, captures provenance

• Weighs evidence according to context • Used to find aging genes in worm,

assess cardiotoxicity of tyrosine kinase inhibitorsHyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.

Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012.

What evidence might we gather?• clinical: Are there cardiotoxic effects associated with the drug?

– Literature (studies) [curated db]– Product labels (studies) [r3:sider]– Clinical trials (studies) [r3:clinicaltrials]– Adverse event reports [r2:pharmgkb/onesides] – Electronic health records (observations)

• pre-clinical associations:– genotype-phenotype (null/disease models) [r2:mgi, r2:sgd; r3:wormbase]– in vitro assays (IC50) [r3:chembl]– drug targets [r2:drugbank; r2:ctd; r3:stitch]– drug-gene expression [r3:gxa]– pathways [r2:kegg; r3:reactome]– Drug-pathway, disease-pathway enrichments [aberrant pathways]– Chemical properties [r2:pubchem; r2.drugbank]– Toxicology [r1.toxkb/cebs]

Beyond Bio2RDF

Network of Linked Data (~2007)

Expansion across domains

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

A rapidly growing network of Linked Data

@micheldumontier::ACS:23-08-16Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"

but the lack of coordination makes Linked Open Data is chaotic and unwieldy

There is no shortage of vocabularies, ontologies and community-based

standards

68 168

metadatacenter.org

NIH COMMONS

Making it Easier, Possibly Even Pleasant, to Author Interoperable Experimental Metadata

PubChem engaged the community to reuse and extend existing vocabularies

34 @micheldumontier::ACS:23-08-16

Semanticscience Ontology (SIO)An effective upper level ontology.1500+ classes207 object properties (inc. inverses)1 datatype property

Chemical Information Ontology (CHEMINF)

• Collaborative ontology• Distinguishes algorithmic, or

procedural information from declarative, or factual information, and renders of particular importance the annotation of provenance to calculated data.

Where are we going?

• Large scale publishing on the web across biomedical datatypes is possible on the web

• Hubs, such as NCBI and EBI now integrate data, but there is need for global coordination on all datatypes

• Standard Vocabularies must to be open, freely accessible, and demonstrably reused

• Use of worldwide data integration formats (RDF) and improved linking of data

• Easier to deploy toolkits for providing standards-compliant linked data

Linked Data Platform

Docker

• Data conversion scripts

• Query Editor

• Faceted Browser

• Relation Exploration

• API

• Data and data store

Model Organism Linked Data

MO-LD.org

In Summary

• We use semantic technologies such as ontologies and linked data to make sense of and facilitate access to biomedical data (FAIR)

• The intimate development and use of standards by PubChem and others brings us closer to an interoperability ideal

• Much more work is needed to support (computational) discovery in a reproducible manner.

AcknowledgementsDumontier Lab• Amrapali Zaveri• Mary Panahiazar• Shima Dastgheib• Sandeep Ayyar• Remzi Celebi• David Odgers• Wei Hu• Ruben Verborgh

• Leo Chepelev• Alison Callahan• Jose Miguel Toledo Cruz• Tanya Hiebert• Beatriz Lujan+ many more

Collaborators• Mark Musen• Nigam Shah• Robert Hoehndorf• Janna Hastings• Christoph Steinbeck • Egon Willighagen• Nico Adams• Colin Batchelor• David Wild• Evan Bolton • Gang Fu+ many more

dumontierlab.commichel.dumontier@stanford.edu

Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier