Ontologically Modeling Sample Variables in Gene Expression Data
James [email protected], Cambridge, UK
Overview
• Application Background• Motivation for ontologies – questions we to answer• Methodology• Ontology and application• Future work/things we’d like to do
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Gene Expression: Archive to Atlas
AE/GEO acquire
>250,000 Assays
>10,000 experiment
s
Re-annotate & summarizeATLAS
ArrayExpress
Curation Curation
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
4
Gene Expression Sample Variable Annotations
Annotations Archive Atlas
Species 330 9
Samples 238,000 34,650
Annotations on samples 860,700 101830
Unique sample annotations 37,500 6600
Assays (Hybridizations) 246,000 30,000
Annotations on assays 569,700 67,000
Unique assay annotations 25,000 4000
Use Cases
• Query support (e.g, query for 'cancer' and get also ‘leukemia')• Data visualisation – e.g., presenting an ontology tree to the user of
what is in the database• Data integration by ontology terms – e.g., we assume that 'kidney' in
independent studies roughly means the same, so we can count how many kidney samples we have in the database
• Intelligent template generation for different experiment types in submission or data presentation
• Summary level data • Nonsense detection – e.g. telling us that something marked as
cancer can not be marked as healthy
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Questions we want to answer
• Diverse nature of annotations on data• Need to support complex queries which contain semantic
information• E.g. which genes are under-expressed in brain samples in
human or mouse
• If we annotate with do we get this data?
cancer
adenocarcinoma
Primary Question: Where to place our semantics?
cancer
adenocarcinoma
Atlas/AE
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Decoupling knowledge from data
Atlas/AE
Methodology: Reference vs Application Ontology• Debate in community about difference, here is our thesis• A reference ontology describes a knowledge space; an
explicitly delineated part of a domain.
Cell type
HumanAnatomy GO Process
Biomedicine
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Methodology: Reference vs Application Ontology• An application ontology describes an application or data space; an
explicitly delineated part of a domain.• Should consume reference ontologies to meet application needs
Cell type
HumanAnatomy GO Process
Biomedicine
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
04/22/2311Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Building the Experimental Factor Ontology• We consume parts of reference ontologies from domain• Construct new classes and relations to answer our use cases• Aim is reuse of existing resources, shared frameworks and mapping of equivalencies where they
exist
EFO
Disease Ontology Anatomy Reference Ontology
Ontology Biomedical Investigations
Chemical Entities of Biological Interest
(ChEBI)
Various Species Anatomy
Ontologies
Relation Ontology
Text mining
Identify Upper Level Structure
• Taken a BFO-lite approach, hiding labels from users for application purposes and sometimes different definition
information content entity (IAO)
site (BFO)
material entity (BFO)
processual entity (BFO)
specifically dependent continuant (BFO)
Specifically dependent continuant: A continuant [snap:Continuant] that inheres in or is borne by other entities. Every instance of A requires some specific instance of B which must always be the same.
Material property: A property or characteristic of some other entity. For example, the mouse has the colour white.
Adding New Classes @ www.ebi.ac.uk/efo/tools• We wish to maximise our interoperability• Submitters and other groups use many ontologies• Trade-off: open to their data and preferences vs imposing a more
ordered view on semantics• Our goal:
Where orthognality exists we aim to import only that classs. Where it does not, we perform ‘mappings’ in our EFO classes via annotation property references (in similar way to xrefs)
• E.g. chebi classes, import chebi URIfor ‘cancer’, create an EFO class and add multiple mappings
Creating Class Mappings
• For overlapping ontologies, we aim to create a ‘mapping class’• Use semi-automated text mining “double-metaphone” algorithm• Perform matching of our values in database to ontology class labels and definitions.• Also perform mappings from EFO to other ontologies, so that EFO: cancer = NCI: cancer, DO: cancer et al.• Sanity checking over mappings before adding to ontology
Keeping Up To Date with External Classes• Use of tool to automatically update metadata every
release (monthly)• Uses BioPortal web services to access latest
definition,synonyms
Class URI/ID
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Detecting Change in External Ontologies
• Bubastis tool for detecting axiomatic changes between two ontologies (in our case 2 versions of same ontology)
• @todo: detect annotation property changes• We also detect missing annotation properties with
Watchman tool (not released yet) – mainly used for labels presently
Creating Relations and Equivalent Classes
cell line (Hela)
organism part (cervix)
cell type (epithelial)
disease (cervical adenocarcinoma)
species (human)
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Structure for queries
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
Gene Expression Atlas
• Linking data to the ontology
Assay Table
Sample Table
Ontology Term Table
QueryOWL Model
Database formulated query
Gene Expression Atlas @ www.ebi.ac.uk/gxa
Query for Cell adhesion genes in all ‘organism parts’
‘View on EFO’
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]
ArrayExpress Archive@ www.ebi.ac.uk/arrayexpress
Developing an Ontology from the Application [email protected]
Future Work: Linked DataLinking data by dereferenceable URI for human and machine
http://www.ebi.ac.uk/gxa/Experiment12345http://www.ebi.ac.uk/gxa/Experiment12345
Future Work: RDF Triple Store@ www.ebi.ac.uk/efo/semanticweb/atlas
• Q: Is an RDF Triple store SPARQL query quicker than a SPARQL translated into SQL?
OWL Ontology
Atlas Data
RDFizerSPARQL
RDF Triple Store
SQL Translation
Layer
Future Work: Data Integration• Consuming reference ontologies and mapping to multiple ontologies
where overlap exists offers us maximum interoperability• The advantage of triple stores is not immediate yet• Impetus required: “should we champion this technology”
Rdf triple
Rdf triple
Rdf triple
Rdf triple
Rdf triple
Rdf triple
QUERY
Atlas
SwissProt
Amino Acid Ontology
Summary
• We have created a sustainable approach to consuming multiple reference ontologies
• Tooling solutions to expedite process• We consider EFO to be a ‘view’ of such ontologies for our application
needs• The primary aim of this work is to enable novel research with the
experimental data we have• Specifically, we can answer new questions, integrate across our data
resources, visualise and summarise the data• Our belief is describing such data should be the driving force behind
ontology development• Future work will look at linked data and rdf triple stores
Acknowledgements• Ontology creation:
• James Malone, Tomasz Adamusiak, Ele Holloway, Helen Parkinson, Jie Zheng (U Penn)
• Ontology Mapping tools and text mining evaluation:• Tim Rayner, Holly Zheng, Margus Lukk
• GUI Development• Misha Kapushesky, Pasha Kurnosov, Anna Zhukova. Nikolay Kolesinkov
• External Review and anatomy:• Jonathan Bard, Jie Zheng
• ArrayExpress Production Staff• EBI Rebholz Group (Whatizit text mining tool)• Many source ontologies for terms and definitions esp. Disease Ontology, Cell Type
Ontology, FMA, NCIT, OBI• Funders: EC (Gen2Phen,FELICS, MUGEN, EMERALD, ENGAGE, SLING), EMBL,
NIH• Eric Neumann, Joanne Luciano and Alan Ruttenberg
W3C & HCLS Group - Eric Prud'hommeaux and Scott Marshall• OBI developers
Ontologically Modeling Sample Variables in Gene Expression Data [email protected]