Ontologies, data standards and controlled
vocabularies
Why use standards and CVs?
• Very important in High-throughput biology to sort through the vast amounts of data
• To use the same data labels universally
• To enable quick retrieval of data
• To enable easy comparison of data
• To remove ambiguities
What’s in a name?
• What is a cell?
What’s in a name?
• What is a cell?
OR
What’s in a name?
• What is a cell?
OR
What’s in a name?
• What is a cell?
Ambiguities in naming• The same name can be used to describe different
concepts, e.g:– Glucose synthesis– Glucose biosynthesis– Glucose formation– Glucose anabolism– Gluconeogenesis
• All refer to the process of making glucose• Makes it difficult to compare the information• Solution: use Ontologies and Data Standards
Ontologies• An ontology is a formal specification of
terms and relationships between them –widely used in biology and boinformatics (e.g. taxonomy)
• The relationships are important and represented as graphs
• Ontology terms should have definitions• Ontologies are machine-readable• They are needed for ordering and
comparing large data sets
Gene Ontology (GO)
• http://www.geneontology.org• Many annotation systems are organism-specific or
different levels of granularity• GO introduced standard vocabulary first used for
mouse, fly and yeast, but now generic• Three ontologies: molecular function, biological
process and cellular component
GO Ontologies
•Molecular function: tasks performed by gene product –e.g. G-protein coupled receptor
•Biological process: broad biological goals accomplished by one or more gene products –e.g. G-protein signaling pathway
•Cellular component: part(s) of a cell of which a gene product is a component; includes extracellular environment of cells –e.g nucleus, membrane etc.
GO hierarchy
Relationships: “is-a”“part of”
How do gene products get GO terms?
• Electronic annotation:– Through mappings to other biological entities and
then automatic inference to proteins
• Manual annotation:– Model organism databases– Gene Ontology Annotation (GOA) project
• Evidence codes –attached to all GO annotations to show the source
Evidence Codes
IEA Inferred from Electronic Annotation
IDA Inferred from Direct Assay
IMP Inferred from Mutant Phenotype
IPI Inferred from Protein Interaction
IEP Inferred from Expression Pattern
IGI Inferred from Genetic Interaction
ISS* Inferred from Sequence or Structural Similarity
IGC Inferred from Genomic Context
RCA Reviewed Computational Analysis
TAS Traceable Author Statement
NAS Non-traceable Author Statement
IC Inferred from Curator Judgement
ND No Data available
Electronic annotation: GO mappings
Electronic annotation: GO mappings
Fatty acid biosynthesis (SwissProt keyword)
EC:6.4.1.2 (EC number)
IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry)
MF_00527: Putative 3-methyladenine DNA glycosylase(HAMAP)
Camon et al. BMC Bioinformatics. 2005; 6 Suppl 1:S17
GO:fatty acid biosynthesis(GO:0006633)
GO:DNA repair (GO:0006281)
GO:acetyl-CoA carboxylaseactivity
(GO:0003989)
GO:acetyl-CoA carboxylase activity
(GO:0003989)
UniProt entry
http://www.ensembl.org/info/data/compara
Automatic transfer of annotations to orthologs
Cow
Dog
Rat
Dog
Rat
Mouse
Ensembl GO term projection via gene homology
Anopheles
Mouse
Chicken
Cow
Drosophila
COMPARA
Homologies between different species calculated
GO terms projected from MANUAL annotation only(IDA, IEP, IGI, IMP, IPI)
One-to-one and apparent one-to-one orthologies only used.
Manual annotation: GOA Project
• Largest open-source contributor of annotations to GO• Member of the GO Consortium since 2001• Provides annotation for more than 130,000 species• GOA’s priority is to annotate the human proteome• GOA is responsible for human, chicken, bovine and
many other annotations for the GO Consortium• Annotation is done through reading of the literature
Reference Genomes
Arabidopsis thaliana Caenorhabditis elegans Danio rerio (zebrafish) Dictyostelium discoideum Drosophila melanogaster Escherichia coli Homo sapiens Saccharomyces cerevisiae Mus musculusSchizosaccharomyces pombe Gallus gallus Rattus norvegicus
• Comprehensive annotation of a set of disease-related proteins in human
• Generate a reliable set of GO annotations for the 12 selected genomes
• Empowers comparative methods used in first pass annotation of other proteomes.
http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
Accessing GO data (1)
QuickGO browser
http://www.ebi.ac.uk/quickgo
Human Insulin Receptor (P06213)
Accessing GO data (2)
Gene Association Files
http://www.geneontology.org/GO.current.annotations.shtm
Accessing GO data (3)
Gene Association File example
Accessing GO data (3)
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/
http://www.ebi.ac.uk/GOA/downloads.html
Downloading GOA data
Functional annotation of proteins
Uses of GO 1
Find functional information on interaction proteins (IntAct)
Uses of GO 2
Microarray data analysis
Proteomics data analysis
Larkin JE et al, Physiol Genomics, 2004
Cunliffe HE et al, Cancer Res, 2003
GO classification
GO classification
Analysis of high-throughput data
Uses of GOAUses of GO 3
Other Ontologies:Open Biomedical Ontologies
http://obo.sourceforge.net
• Central location for accessing well-structured controlled vocabularies and ontologies for use in the biological and medical sciences.
• Provides simple format for ontologies that can encode terms, relationships between terms and definitions of terms including those taken from external ontologies.
Scope of Open Biomedical Ontologies
• Anatomy• Animal natural history and life history• Chemical• Development• Ethology• Evidence codes• Experimental conditions• Genomic and proteomic• Metabolomics• OBO relationship types• Phenotype• Taxonomic classification
Ontology Lookup Service (OLS)
• Single point of query for currently 47 ontologies.
• Ontologies are updated daily from CVS repositories, including the OBO CVS repository and the PRIDE CVS repository.
• A tool that offers interactive and programmatic interfaces for queries on term names, synonyms, relationships, annotations and database cross-references.
• Originally developed for using ontologies in PRIDE.
• These relationships have consequences when querying a database annotated using the ontology.
• What happens when I ask for PRIDE experiments describing the proteome of brain tissue?
The issue faced
Using Ontologies in PRIDE
For an experiment you want to define:– Species: Newt / NCBI Taxonomy ID– Tissue / organ / cell type: BRENDA Tissue
ontology, Cell Type ontology;– Sub-cellular component: Gene Ontology: GO;– Disease: Human Disease: DOID;– Genotype: GO;– Sample Processing: PSI Ontology;– Mass Spectrometry: PSI-MS Ontology;– Protein Modifications: PSI-MOD Ontology
OLS usage examples
• http://www.ebi.ac.uk/ontology-lookup/• What is the accession for “mitochondrion” in GO? In MeSH?
– search by term name in a specific ontology or across all
• I’m looking for a term to annotate my protocol step but I’m not sure what term to use.– browse an ontology
• I’m looking for all the experiments done on liver tissue?– get all children term of liver and query on those as well
• My data set was annotated with GO version 123 but that was a long time ago?– get updated term names for the identifiers you have and see if any have
been made obsolete
Standards for data exchange
• Systems Biology Markup Language (SBML) –computer-readable format for representing models of networks
• Biological Pathways Exchange (BioPAX) –format for representing pathways
• Proteomics Standards Initiative (PSI, MIAPE)
• Microarray standards –MIAME and MAGE
MIAPE/MIAME principles
• Enough information to: – Remove ambiguity in experiment– Allow easy interpretation of results– Allow experiment to be repeated– Enable comparison across similar experiments
• Use controlled vocabularies
Using ontologies and standards
• So much data in different places –need to organize and share it
• Used for data retrieval and comparison –easier to query
• Used for data integration and exchange –standard representation
• Used for evaluation –need “gold standard”