The ELIXIR of Linked Data - Open · PDF fileThe ELIXIR of Linked Data ... (and Elasticsearch)...

European Life Sciences Infrastructure for Biological Information

www.elixir-europe.org

The ELIXIR of Linked DataProfessor Carole Goble (UK node)

Barend Mons (NL node) , Helen Parkinson (EMBL-EBI node)

The Interoperability Services Backbone Team

What is ELIXIR?

An international distributed infrastructure for life-science information

orchestrate the collection, quality control and archiving of biological data produced

by life science experiments.

integrate research data

ensure a seamless service provision that is easily accessible to all.

http://www.elixir-europe.org/about

ELIXIR: An international distributed infrastructure for biological data

Hub

major bioinformaticsservice providers (~130) 16 ELIXIR members

4 observers

Drivers: Infrastructure Providers

COordinated Research Infrastructures Building Enduring Life-science Services

Marine metagenomics

Human data

Crop and forest plants

Rare diseases

Rare diseases

Genomic data

(WES, WGS)

Other omics data

(transcriptomics,

metabolomics,

proteomics …)

Sample data

(biobank

databases)

Clinical data

(registries, and

phenotypic databases)

1000 exomes1000 exomes

+ > 2500 from other projects

Drug prioritization for Huntington’s DiseaseKaterina Nosikova, Elizaveta Besedina, Eelke van der Horst, Peter-Bram ‘t Hoen, Marco Roos, Eleni Mina, Human Genetics department, LUMC, NL

8

Select

genes by

phenotype

matching

in Monarch

Select drug

compounds in

Open PHACTS

Filter on

feasibility for

treating HD

Prioritized

drug

compounds

What is ELIXIR?

Technical platforms

Data

Tools

Compute

Training

Secure and deliver core data resources

Discoverable tools, services and connectors for data access and exploitation

Robust technical platforms and clouds for secure data access, data exchange and compute

Training programme for professionals, bridging the computational biology skills gap

Standards Data management, reuse and integration

Findable Accessible Interoperable Reusable

Training: BYODs, data wrangling, governance and quality assurance

Linked Data experts,

data experts from

MycoBase and

Human Protein Atlas

http://www.macs.hw.ac.uk/~ajg33/first-byod-workshop/

Tomato genome, phenotypic

observations, variants

http://www.mycobank.org/

ImpactScientific focus

Indicators

Scientific

impact

Community

Legal &funding

infrastructure

Quality

Data: Basket of indicators, reflecting the multiple facets of bioinformatics resources

1) Scientific focus and quality of sciencee.g. curational effort, benchmarking

2) Community served by the resourcee.g. web statistics

3) Quality of servicee.g. uptime, user support and training

4) Legal and funding infrastructuree.g. institutional support, use policy

5) Impact and translational stories

Mandatory and optional

Compute Platform: Authentication, Archiving and Movement

Tools Interoperability and APIs

Describing Tools

EDAM Ontology

Describing Workflows

Common format for bioinformatics tool execution

http://commonwl.org/

Rich: Linked Data allows for infinite metadata annotations and reasoning

SWAGGER.json

Describing APIs

API changes Semantic versioningGetting resources to have APIs

http://commonwl.org/

[Luiz Olavo Bonino, DTL] RD-CONNECT, ODEXA4ALL

A FAIRifying Architecture

Warehouses

Preparing SourcesOn boardingDatasets, Content, API

Access fromIntegratingFrameworks

InteroperabilityServices:Identifiers, Ontologies, Schemas.

API

FAIR Interoperability Backbone ServicesPrepare for interop

• Various species: maize,

pine, potato…

• Various data types: from

genomes (sequences and annotations) to phenomes (traits)

• Various ontologies: Crop

Ontology, Plant Ontology…

• Emerging standards: MIAPPE (Minimum Information on Plant Phenotyping Experiment)

Need for infrastructureo Manage identifiers o Register/access

services and data sets

o Metadata driven search

© Paul Kersey

Crop and forest plants

Ontology ServicesOntology mappingData-Ontology Tools

OLS3

Identifiers – the pivot of everything!

Identifier Mapping Service (IMS)

Identifier Resolution Service (IRS2)

FAIR Metadata at many levels

Tool that provisioned the dataset

Dataset Collection

Dataset Profile

Data recordcontent

mappingsbetween entities

mappingsbetween datasets

Interface API and Access

Tool using the dataset

What is ELIXIR?

Metadata Profiles and Dataset RegistrationGovernance, Compliance, Release Protocols

Dataset Profile

DataDiabetic nephropathy (EFO_0000401)

Data

BioSolr

(and Elasticsearch)

Search, Index and Linked Data

Biological knowledge bases

Curated and annotated biological entities and their

relationships

Uniprot, Ensembl, ChEMBL, Orphanet

Two tiers of data repository

Two tiers of data repository

Biological knowledge bases

Curated and annotated biological entities and their

relationships

Uniprot, Ensembl, ChEMBL, Orphanet

data records are dynamic and incomplete

records update, diverge, merge

over time, interpretation

changes

identifier resolution varies over time –

relationships between records are

unstable

“reproducibility” potentially

compromised

a novel gene-rare disease relationship is reported

consequences of a single nucleotide change in a regulatory genomic region is better understood.

Legacy of Open PHACTS. Mappings are first class.

Data recordcontent

mappingsbetween entities

linksets

provenance, versioning, mappinglinksets

VoID – Vocabulary of Interlinked Datasets

• Create description of a Linkset that connects two datasets.

• Select datasets from existing descriptions.

• Capture link predicate and justification

Legacy of Open PHACTS.Releasing Data Sets: Software-Like Research ObjectsLinked Data Manifests

“Publishing data the software way”

Controlled data Distribution

ContainersBuilds

DependenciesVersioningVerification

data-maven-plugin

Docker

Genotype-Phenotype

Genotype-Phenotype

Deans AR, Lewis SE, Huala E, Anzaldo SS, Ashburner M, et al. (2015) Finding Our Way through Phenotypes. PLoS Biol 13(1):

e1002033. doi:10.1371/journal.pbio.1002033

http://journals.plos.org/plosbiology/article?id=info:doi/10.1371/journal.pbio.1002033

Mapping terms

Cross linking datasets

Tracking provenance

Linked Data Services

http://journals.plos.org/plosbiology/article?id=info:doi/10.1371/journal.pbio.1002033

Publishing FAIR Data

Interoperating Applications

InteroperabilityBackbone

Interoperability Services Backbone

Linked Data – Big Picture• lower the barriers to linking data

• connect related data that wasn't previously linked

• self-describe and annotate data in a common, machine readable form

• expose linking as a first class information element

“a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF.“ Wikipedia

Impact of Open PHACTS on ELIXIR Linked Data

Components & Know-how

• Identifiers & Links

• Annotation & Ontologies

• Dataset Containers

• Integrate into off the shelf apps

Publishing and Consuming

• Metadata & Mappings

• On boarding & Release pipelines

• APIs, Search

Data …….when it supports interoperability….retain native forms ….preparation and maintenance….data governance…..

Challenges of Linked Data

Getting data providers to generate LOD

Getting agreement on URIs

Choosing ontologies and relations

Modelling challenges (data vs biological reality)

Appropriate Extract/Load/Transform pipelines

Appropriate representation for datatypes

Getting machine readable dataset descriptions

Expertise in the community to effectively produce/consume LD

Services for finding and reusing URIs & ontologies

Data annotation services (mapping data to ontologies)

Provide an API

Link resources to ontology terms

SPARQL fetish

[Mons]

What is ELIXIR?

Human data: The European Genome-phenome Archive EGA

Date post:	09-Mar-2018
Category:	Documents
Upload:	dangnhan
View:	235 times
Download:	3 times

The ELIXIR of Linked Data - Open · PDF fileThe ELIXIR of Linked Data ... (and Elasticsearch)...

Documents