Building a Network of Interoperable and Independently Produced Linked and Open Biomedical Data

Post on 15-Apr-2017

227 views 0 download

transcript

1

Building a network of interoperable and independently produced linked and open

biomedical data

Michel Dumontier, Ph.D.

Associate Professor of Medicine (Biomedical Informatics)Stanford University

@micheldumontier::ACS:23-08-16An invited talk in support of the 2016 Herman Skolnik Awardees

@micheldumontier::ACS:23-08-162

My research aims to develop computational methods for biomedical knowledge discovery

We develop tools and methods to represent, store, publish, integrate, query, and reuse biomedical data, software, and ontologies

@micheldumontier::ACS:23-08-163

reuse needs to be considered firmly in the context of discovery and

reproducibility

@micheldumontier::ACS:23-08-164

Most published research findings are false- John Ioannidis, Stanford University

@micheldumontier::ACS:23-08-165

Reproducible discovery

1. Data Science Tools and Methods– Infrastructure: To identify, annotate, link, integrate,

search for and query data and services– Tools: To identify and uncover support for known or

novel associations2. Community Standards to contribute to and interrogate a massive, decentralized network of interconnected data and software

@micheldumontier::ACS:23-08-166

FAIR: Findable, Accessible, Interoperable, Re-usable

@micheldumontier::ACS:23-08-167

FAIR: Findable, Accessible, Interoperable, Re-usable

Findable– Globally unique identifiers for datasets and the data they contain– Rich set of descriptors to search and filter with– Indexed and searchable

Accessible– Identifiers can be used to retrieve representations using standard protocols

(e.g. HTTP)– Metadata is always available.

Interoperable– Data represented with formal knowledge representations– Include links to other datasets/vocabularies

Reusable– Licensing, Provenance, Community standards

@micheldumontier::ACS:23-08-168

The Semantic Web is the new global web of knowledge

standards for publishing, sharing and querying facts, expert knowledge and services

scalable approach for the discoveryof independently formulated

and distributed knowledge

@micheldumontier::ACS:23-08-169

Linked Data offers a solid foundation for FAIR data

• Entities (people, proteins, pathways, etc) are identified using globally unique identifiers (URIs)

• Entity descriptions are represented with a standardized language (RDF)

• Data can be retrieved using a universal protocol (HTTP)

• Entities (concepts, data, resources) can be linked together to increase interoperability

@micheldumontier::ACS:23-08-16

Linked Data for the Life Sciences

10

Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF.

chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications

• 11B+ interlinked statements from 35 biomedical datasets and 400+ ontologies

• dataset description, provenance & statistics• A growing interoperable ecosystem with the EBI,

NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool providers

11

Bio2RDF normalizes identifiers, formats, links, and access

@micheldumontier::ACS:23-08-16

@micheldumontier::ACS:23-08-1612

@micheldumontier::ACS:23-08-1613

Bio2RDF shows how datasets are connected together

14

Queries can be federated across private and public SPARQL databases

Get all protein catabolic processes (and more specific GO terms) in biomodels

SELECT ?go ?label count(distinct ?x) WHERE { service <http://bioportal.bio2rdf.org/sparql> { ?go rdfs:label ?label . ?go rdfs:subClassOf+ ?tgo ?tgo rdfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") } service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . }}

@micheldumontier::ACS:23-08-16

@micheldumontier::ACS:23-08-1615

Graph-like representation amenableto finding mismatches and discovering new links

W Hu, H Qiu, M Dumontier. Link Analysis of Life Science Linked Data. International Semantic Web Conference (2) 2015: 446-462.

@micheldumontier::ACS:23-08-1616

EbolaKB Using Linked Data and Software

Kamdar, Dumontier. An Ebola virus-centered knowledge base. Database. 2015 Jun 8;2015. doi: 10.1093/database/bav049.

@micheldumontier::ACS:23-08-1617

Network analysis and discovery

McCusker, McGuiness, Dumontier. In prep.

@micheldumontier::ACS:23-08-1618

Can we implement an open version of PREDICT using Linked Data?

AUC 0.91 across all therapeutic indications

A. Chemical structure Similarity

B. Side Effect Similarity

C. Target Sequence Similarity

D. Target Functional Similarity

E. Network Distance

A. Phenotype Based

B. Text Extracted Concepts

Disease-disease similarityDrug-drug similarity

@micheldumontier::ACS:23-08-1619

HyQue: Hypothesis Validation

• A platform for knowledge discovery that uses data retrieval coupled with automated reasoning to validate scientific hypotheses

• Leverages semantic technologies to provide access to linked data, ontologies, and semantic web services

• Uses positive and negative findings, captures provenance

• Weighs evidence according to context • Used to find aging genes in worm,

assess cardiotoxicity of tyrosine kinase inhibitorsHyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.

Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012.

@micheldumontier::ACS:23-08-1620

What evidence might we gather?• clinical: Are there cardiotoxic effects associated with the drug?

– Literature (studies) [curated db]– Product labels (studies) [r3:sider]– Clinical trials (studies) [r3:clinicaltrials]– Adverse event reports [r2:pharmgkb/onesides] – Electronic health records (observations)

• pre-clinical associations:– genotype-phenotype (null/disease models) [r2:mgi, r2:sgd; r3:wormbase]– in vitro assays (IC50) [r3:chembl]– drug targets [r2:drugbank; r2:ctd; r3:stitch]– drug-gene expression [r3:gxa]– pathways [r2:kegg; r3:reactome]– Drug-pathway, disease-pathway enrichments [aberrant pathways]– Chemical properties [r2:pubchem; r2.drugbank]– Toxicology [r1.toxkb/cebs]

@micheldumontier::ACS:23-08-1621

HyQue

@micheldumontier::ACS:23-08-1622

Beyond Bio2RDF

@micheldumontier::ACS:23-08-1623

Network of Linked Data (~2007)

@micheldumontier::ACS:23-08-1624

Expansion across domains

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

25

A rapidly growing network of Linked Data

@micheldumontier::ACS:23-08-16Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"

@micheldumontier::ACS:23-08-1626

@micheldumontier::ACS:23-08-1627

@micheldumontier::ACS:23-08-1628

@micheldumontier::ACS:23-08-1629

but the lack of coordination makes Linked Open Data is chaotic and unwieldy

@micheldumontier::ACS:23-08-1630

There is no shortage of vocabularies, ontologies and community-based

standards

@micheldumontier::ACS:23-08-1631

68 168

@micheldumontier::ACS:23-08-1632

metadatacenter.org

NIH COMMONS

Making it Easier, Possibly Even Pleasant, to Author Interoperable Experimental Metadata

@micheldumontier::ACS:23-08-1633

PubChem engaged the community to reuse and extend existing vocabularies

34 @micheldumontier::ACS:23-08-16

Semanticscience Ontology (SIO)An effective upper level ontology.1500+ classes207 object properties (inc. inverses)1 datatype property

@micheldumontier::ACS:23-08-1635

Chemical Information Ontology (CHEMINF)

• Collaborative ontology• Distinguishes algorithmic, or

procedural information from declarative, or factual information, and renders of particular importance the annotation of provenance to calculated data.

@micheldumontier::ACS:23-08-1636

Where are we going?

• Large scale publishing on the web across biomedical datatypes is possible on the web

• Hubs, such as NCBI and EBI now integrate data, but there is need for global coordination on all datatypes

• Standard Vocabularies must to be open, freely accessible, and demonstrably reused

• Use of worldwide data integration formats (RDF) and improved linking of data

• Easier to deploy toolkits for providing standards-compliant linked data

37

Linked Data Platform

Docker

• Data conversion scripts

• Query Editor

• Faceted Browser

• Relation Exploration

• API

• Data and data store

Model Organism Linked Data

MO-LD.org

@micheldumontier::ACS:23-08-1638

In Summary

• We use semantic technologies such as ontologies and linked data to make sense of and facilitate access to biomedical data (FAIR)

• The intimate development and use of standards by PubChem and others brings us closer to an interoperability ideal

• Much more work is needed to support (computational) discovery in a reproducible manner.

39

AcknowledgementsDumontier Lab• Amrapali Zaveri• Mary Panahiazar• Shima Dastgheib• Sandeep Ayyar• Remzi Celebi• David Odgers• Wei Hu• Ruben Verborgh

• Leo Chepelev• Alison Callahan• Jose Miguel Toledo Cruz• Tanya Hiebert• Beatriz Lujan+ many more

Collaborators• Mark Musen• Nigam Shah• Robert Hoehndorf• Janna Hastings• Christoph Steinbeck • Egon Willighagen• Nico Adams• Colin Batchelor• David Wild• Evan Bolton • Gang Fu+ many more

@micheldumontier::ACS:23-08-16

@micheldumontier::ACS:23-08-1640

dumontierlab.commichel.dumontier@stanford.edu

Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier