Master headline
RDFizing the EBI Gene Expression Atlas
James Malone, Electra Tapanari
Master headline
- Initial motivation is explorative- Can we ask new questions?- Do we get new answers?- Can we integrate this data with other related
data?- Is there a sufficient user community to justify an
RDF Atlas resource?
Motivation
Master headline
SESL Project
- Semantic Enrichment of Scientific Literature Working Group
- Includes EBI (Dietrich Rebholz) and Pistoia Alliance
- Pilot project in 2010 looking at Developing knowledge brokering standards for semantic integration of gene to Type II diabetes data using Gene Expression Atlas, OMIM, UniProt literature
Master headline
Gene Expression: Archive to Atlas
AE/GEO acquire
>250,000 Assays
>10,000 experiments
Re-annotate & summarizeATLAS
ArrayExpress
Curation Curation
Master headline 04/20/235
Experimental Factor Ontology• We consume parts of reference ontologies from domain• Construct new classes and relations to answer our use cases• Aim is reuse of existing resources, shared frameworks and mapping of equivalencies where they exist
EFO
Disease Ontology Anatomy Reference Ontology
Ontology Biomedical Investigations
Chemical Entities of Biological Interest
(ChEBI)
Various Species Anatomy
Ontologies
Relation Ontology
Text mining
Master headline
Gene Expression Atlas @ www.ebi.ac.uk/gxa
Query for Cell adhesion genes in all ‘organism parts’
‘View on EFO’
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Master headline
Input XML
Master headline
Mapping XML Results to RDF (1)
Id here is an ENSEMBL Gene ID, e.g. RUNX1 (ENSG00000159216)
• Gene to related transcripts, sequence and gene functions • Also EFO ontology classes in RDF form (shown is label to IRI
triple)
Master headline
Mapping XML Results to RDF (2)
• Connecting gene and ontology id together with experimental metrics
Master headline
Mapping XML Results to RDF (3)
• Connecting gene with experimental metadata
Master headline
Relationship Issues
• EFO attempts to follow OBO Foundry guidance and uses the OBO Relation Ontology
• OBI model is more complex, e.g. the relation between sample and measure is indirect*
• Relationship between some of entities is still not well represented across community, even protein product to gene (see my post to OBO list)
• is_about relation is very generic and largely meaningless
• We will use RO where possible, subclass RO otherwise and continue to monitor OBO
*see Brinkman et al, (2010) Modeling biomedical experimental processes with OBI, JBMS, 1(Suppl 1):S7
Master headline
Display of query results in Gene Expression Atlas DB
Already: 1) JSON format 2) XML format Plus now: 3) RDF format
Master headline
Java code RDF triples XML doc
XML result doc from Atlas
INPUT
PROCESS OUTPUT
XML doc with triple patterns
RDF pipeline
• Pipeline for generating the RDF given the XML input
• note this works with any XML code
Master headline
Triple Pattern specification
Master headline
Example RDF
Master headline
Blank Node Connections
• First row (n1_0 ) 7 triples
Master headline
• Is there a community that warrants directing resources towards this?
• Can we answer new questions?
• Can we integrate with other data sources?
• Can we consolidate complex, non-interoperable ontologies?
• EFO represents a view on this but is a scoped, pragmatic choice – will this indeed always be the case?
Discussion
Master headline
Acknowledgements
• Electra Tapanari (intern that did bulk of implementation)
• Dietrich Rebholz-Schumann (funding internship)
• Christoph Grabmuller
• Misha Kapushesky
• Helen Parkinson
• Contact me
James Malone: [email protected]