European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
The ELIXIR of Linked DataProfessor Carole Goble (UK node)
Barend Mons (NL node) , Helen Parkinson (EMBL-EBI node)
The Interoperability Services Backbone Team
What is ELIXIR?
An international distributed infrastructure for life-science information
orchestrate the collection, quality control and archiving of biological data produced
by life science experiments.
integrate research data
ensure a seamless service provision that is easily accessible to all.
http://www.elixir-europe.org/about
ELIXIR: An international distributed infrastructure for biological data
Hub
major bioinformaticsservice providers (~130) 16 ELIXIR members
4 observers
Drivers: Infrastructure Providers
COordinated Research Infrastructures Building Enduring Life-science Services
Marine metagenomics
Human data
Crop and forest plants
Rare diseases
Rare diseases
Genomic data
(WES, WGS)
Other omics data
(transcriptomics,
metabolomics,
proteomics …)
Sample data
(biobank
databases)
Clinical data
(registries, and
phenotypic databases)
1000 exomes1000 exomes
+ > 2500 from other projects
Drug prioritization for Huntington’s DiseaseKaterina Nosikova, Elizaveta Besedina, Eelke van der Horst, Peter-Bram ‘t Hoen, Marco Roos, Eleni Mina, Human Genetics department, LUMC, NL
8
Select
genes by
phenotype
matching
in Monarch
Select drug
compounds in
Open PHACTS
Filter on
feasibility for
treating HD
Prioritized
drug
compounds
What is ELIXIR?
Technical platforms
Data
Tools
Compute
Training
Secure and deliver core data resources
Discoverable tools, services and connectors for data access and exploitation
Robust technical platforms and clouds for secure data access, data exchange and compute
Training programme for professionals, bridging the computational biology skills gap
Standards Data management, reuse and integration
Findable Accessible Interoperable Reusable
Training: BYODs, data wrangling, governance and quality assurance
Linked Data experts,
data experts from
MycoBase and
Human Protein Atlas
http://www.macs.hw.ac.uk/~ajg33/first-byod-workshop/
Tomato genome, phenotypic
observations, variants
ImpactScientific focus
Indicators
Scientific
impact
Community
Legal &funding
infrastructure
Quality
Data: Basket of indicators, reflecting the multiple facets of bioinformatics resources
1) Scientific focus and quality of sciencee.g. curational effort, benchmarking
2) Community served by the resourcee.g. web statistics
3) Quality of servicee.g. uptime, user support and training
4) Legal and funding infrastructuree.g. institutional support, use policy
5) Impact and translational stories
Mandatory and optional
Compute Platform: Authentication, Archiving and Movement
Tools Interoperability and APIs
Describing Tools
EDAM Ontology
Describing Workflows
Common format for bioinformatics tool execution
http://commonwl.org/
Rich: Linked Data allows for infinite metadata annotations and reasoning
SWAGGER.json
Describing APIs
API changes Semantic versioningGetting resources to have APIs
[Luiz Olavo Bonino, DTL] RD-CONNECT, ODEXA4ALL
A FAIRifying Architecture
Warehouses
Preparing SourcesOn boardingDatasets, Content, API
Access fromIntegratingFrameworks
InteroperabilityServices:Identifiers, Ontologies, Schemas.
API
FAIR Interoperability Backbone ServicesPrepare for interop
• Various species: maize,
pine, potato…
• Various data types: from
genomes (sequences and annotations) to phenomes (traits)
• Various ontologies: Crop
Ontology, Plant Ontology…
• Emerging standards: MIAPPE (Minimum Information on Plant Phenotyping Experiment)
Need for infrastructureo Manage identifiers o Register/access
services and data sets
o Metadata driven search
© Paul Kersey
Crop and forest plants
Ontology ServicesOntology mappingData-Ontology Tools
OLS3
Identifiers – the pivot of everything!
Identifier Mapping Service (IMS)
Identifier Resolution Service (IRS2)
FAIR Metadata at many levels
Tool that provisioned the dataset
Dataset Collection
Dataset Profile
Data recordcontent
mappingsbetween entities
mappingsbetween datasets
Interface API and Access
Tool using the dataset
What is ELIXIR?
Metadata Profiles and Dataset RegistrationGovernance, Compliance, Release Protocols
Dataset Profile
DataDiabetic nephropathy (EFO_0000401)
Data
BioSolr
(and Elasticsearch)
Search, Index and Linked Data
Biological knowledge bases
Curated and annotated biological entities and their
relationships
Uniprot, Ensembl, ChEMBL, Orphanet
Two tiers of data repository
Two tiers of data repository
Biological knowledge bases
Curated and annotated biological entities and their
relationships
Uniprot, Ensembl, ChEMBL, Orphanet
data records are dynamic and incomplete
records update, diverge, merge
over time, interpretation
changes
identifier resolution varies over time –
relationships between records are
unstable
“reproducibility” potentially
compromised
a novel gene-rare disease relationship is reported
consequences of a single nucleotide change in a regulatory genomic region is better understood.
Legacy of Open PHACTS. Mappings are first class.
Data recordcontent
mappingsbetween entities
linksets
provenance, versioning, mappinglinksets
VoID – Vocabulary of Interlinked Datasets
• Create description of a Linkset that connects two datasets.
• Select datasets from existing descriptions.
• Capture link predicate and justification
Legacy of Open PHACTS.Releasing Data Sets: Software-Like Research ObjectsLinked Data Manifests
“Publishing data the software way”
Controlled data Distribution
ContainersBuilds
DependenciesVersioningVerification
data-maven-plugin
Docker
Genotype-Phenotype
Genotype-Phenotype
Deans AR, Lewis SE, Huala E, Anzaldo SS, Ashburner M, et al. (2015) Finding Our Way through Phenotypes. PLoS Biol 13(1):
e1002033. doi:10.1371/journal.pbio.1002033
http://journals.plos.org/plosbiology/article?id=info:doi/10.1371/journal.pbio.1002033
Mapping terms
Cross linking datasets
Tracking provenance
Linked Data Services
Publishing FAIR Data
Interoperating Applications
InteroperabilityBackbone
Interoperability Services Backbone
Linked Data – Big Picture• lower the barriers to linking data
• connect related data that wasn't previously linked
• self-describe and annotate data in a common, machine readable form
• expose linking as a first class information element
“a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF.“ Wikipedia
Impact of Open PHACTS on ELIXIR Linked Data
Components & Know-how
• Identifiers & Links
• Annotation & Ontologies
• Dataset Containers
• Integrate into off the shelf apps
Publishing and Consuming
• Metadata & Mappings
• On boarding & Release pipelines
• APIs, Search
Data …….when it supports interoperability….retain native forms ….preparation and maintenance….data governance…..
Challenges of Linked Data
Getting data providers to generate LOD
Getting agreement on URIs
Choosing ontologies and relations
Modelling challenges (data vs biological reality)
Appropriate Extract/Load/Transform pipelines
Appropriate representation for datatypes
Getting machine readable dataset descriptions
Expertise in the community to effectively produce/consume LD
Services for finding and reusing URIs & ontologies
Data annotation services (mapping data to ontologies)
Provide an API
Link resources to ontology terms
SPARQL fetish
[Mons]
What is ELIXIR?
Human data: The European Genome-phenome Archive EGA