BI class 2010
Gene OntologyOverview and Perspective
What is Ontology?
• Dictionary:
A branch of metaphysics concerned with the nature and relations of being
16061700s
What is the Gene Ontology?
• Allows biologists to make queries across large numbers of genes without researching each one individually
So what does that mean?From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things.
Car?
Ontology -definition
Gene Ontology (GO) Consortium
aa
www.geneontology.org• Formed to develop a shared language adequate for the annotation of molecular characteristics across organisms; a common language
to share knowledge.
• Seeks to achieve a mutual understanding of the definition and meaning of any word used; thus we are able to support cross-database queries.
• Members agree to contribute gene product annotations and associated sequences to GO database.
How does GO work?
• What does the gene product do?
• Where and does it act?
• Why does it perform these activities?
What information might we want to capture about a gene product?
What is the Gene Ontology?
• Set of standard biological phrases (terms) which are applied to genes/proteins:– protein kinase– apoptosis– membrane
Molecular Function = elemental activity/task
– the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity
Biological Process = biological goal or objective
– broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions
Cellular Component = location or complex– subcellular structures, locations, and macromolecular complexes;
examples include nucleus, telomere, and RNA polymerase II holoenzyme
GO represents three biological domains
The GO is Actually Three Ontologies
Biological ProcessGO term: tricarboxylic acid cycleSynonym: Krebs cycleSynonym: citric acid cycleGO id: GO:0006099
Cellular ComponentGO term: mitochondrionGO id: GO:0005739Definition: A semiautonomous, self replicating organelle that occurs in varying numbers, shapes, and sizes in the cytoplasm of virtually all eukaryotic cells. It is notably the site of tissue respiration.
Molecular FunctionGO term: Malate dehydrogenase. GO id: GO:0030060(S)-malate + NAD(+) = oxaloacetate + NADH.
H
O
H
O
O
H
O
H
O
H
H
O
O
H
O
H
O
H
H
O
NAD+NADH + H+
Cellular Component
• where a gene product acts
Cellular Component
Cellular Component
Cellular Component
• Enzyme complexes in the component ontology refer to places, not activities.
Molecular Function
• activities or “jobs” of a gene product
glucose-6-phosphate isomerase activity
Molecular Function
insulin binding
insulin receptor activity
Molecular Function
• A gene product may have several functions
• Sets of functions make up a biological process.
Biological Process
a commonly recognized series of events
cell division
Biological Process
transcription
Biological Process
regulation of gluconeogenesis
Biological Process
limb development
Ontology Structure
• Terms are linked by two relationships– is-a – part-of
Ontology Structure
cell
membrane chloroplast
mitochondrial chloroplastmembrane membrane
is-apart-of
Ontology Structure
• Ontologies are structured as a hierarchical directed acyclic graph (DAG)
• Terms can have more than one parent and zero, one or more children
Ontology Structure
cell
membrane chloroplast
mitochondrial chloroplastmembrane membrane
Directed Acyclic Graph (DAG) - multiple
parentage allowed
– what kinds of things exist?
– what are the relationships between these things?
eye
_part of
lens
_is a
sense organ
developsfrom
Optic placode
A biological ontology is:• A (machine) interpretable representation of
some aspect of biological reality
GO Definitions: Each GO term has 2 Definitions
A definition written by a biologist:
necessary & sufficientconditions
written definition(not computable)
Graph structure: necessary conditions
formal(computable)
Appropriate Relationships to Parents
• GO currently has 2 relationship types– Is_a
• An is_a child of a parent means that the child is a complete type of its parent, but can be discriminated in some way from other children of the parent.
– Part_of• A part_of child of a parent means that the child is
always a constituent of the parent that in combination with other constituents of the parent make up the parent.
Placement in the Graph: Selecting Parents
• To make the most precise definitions, new terms should be placed as children of the parent that is closest in meaning to the term.
• To make the most complete definitions, terms should have all of the parents that are appropriate.
• In an ontology as complicated as the GO this is not as easy as it seems.
True Path Violations Create Incorrect Definitions
..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
chromosome
Part_of relationship
nucleus
True Path Violations
..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
chromosome
Mitochondrial chromosome
Is_a relationship
True Path Violations
..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
chromosome
Mitochondrial chromosome
Is_a relationship
Part_of relationship
nucleus
A mitochondrial chromosome is not part of a nucleus!
True Path Violations
..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
nucleus chromosome
Nuclear chromosome
Mitochondrial chromosome
Is_a relationshipsPart_of
relationship
mitochondrion
Part_of relationship
The Development Node(some example for consistent definitions)
Cell level
[i] y cell differentiation---[p] y cell fate commitment ------[p] y cell fate specification ------[p] y cell fate determination ---[p] y cell development ------[p] y cellular morphogenesis during differentiation ------[p] y cell maturation
y cell differentiation
The process whereby a relatively unspecialized cell acquires specialized features of a y cell.
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=cmed6.figgrp.41173
[i] y cell differentiation---[p] y cell fate commitment------[p] y cell fate specification------[p] y cell fate determination---[p] y cell development------[p] y cellular morphogenesis during differentiation------[p] y cell maturation
y cell fate commitment
The process whereby the developmental fate of a cell becomes restricted such that it will develop into a y cell.
[i] y cell differentiation---[p] y cell fate commitment------[p] y cell fate specification------[p] y cell fate determination---[p] y cell development------[p] y cellular morphogenesis during differentiation------[p] y cell maturation
y cell fate specification
The process whereby a cell becomes capable of differentiating autonomously into a y cell in an environment that is neutral with respect to the developmental pathway. Upon specification, the cell fate can be reversed.
[i] y cell differentiation---[p] y cell fate commitment------[p] y cell fate specification------[p] y cell fate determination---[p] y cell development------[p] y cellular morphogenesis during differentiation------[p] y cell maturation
y cell fate determination
The process whereby a cell becomes capable of differentiating autonomously into a y cell regardless of its environment; upon determination, the cell fate cannot be reversed.
[i] y cell differentiation---[p] y cell fate commitment------[p] y cell fate specification------[p] y cell fate determination---[p] y cell development------[p] y cellular morphogenesis during differentiation------[p] y cell maturation
Gene Ontology widely adopted
AgBase
Terms are defined graphically relative to other terms
The Gene Ontology (GO)
1. Build and maintain logically rigorous and biologically accurate ontologies
2. Comprehensively annotate reference genomes
3. Support genome annotation projects for all organisms
4. Freely provide ontologies, annotations and tools to the research community
• The GO is still developing daily both in ontological structures and in domain knowledge
• Ontology development workshops focus on specific domains needing experts
• 2 workshops / year1. Metabolism and cell cycle 2. Immunology and defense response 3. Early CNS development 4. Peripheral nervous system development 5. Blood Pressure Regulation 6. Muscle Development
Building the ontologies
Mappings files
Fatty acid biosynthesis ( Swiss-Prot Keyword)
EC:6.4.1.2 (EC number)
IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry)
GO:Fatty acid biosynthesis
(GO:0006633)
GO:acetyl-CoA carboxylase activity
(GO:0003989)
GO:acetyl-CoA carboxylaseactivity
(GO:0003989)
725 new terms related to immunology
127 new terms added to cell type ontology
Building the ontology: Immune System Process
Red part_of
Blue is_a
P05147
PMID: 2976880
GO:0047519IDA
P05147 GO:0047519 IDA PMID:2976880
GO Term
Reference
Evidence
Annotating Gene Products using GO
Gene Product
Gene protein inherits GO term
• There is evidence that this gene product can be best classified using this term
• The source of the evidence and other information is included
• There is agreement on the meaning of the term
Annotations for APP: amyloid beta (A4) precursor protein
Annotations are assertions
NO Direct ExperimentInferred from evidence
Direct Experiment in organism
We use evidence codes to describe the basis of the annotation
• IDA: Inferred from direct assay• IPI: Inferred from physical interaction• IMP: Inferred from mutant phenotype• IGI: Inferred from genetic interaction• IEP: Inferred from expression pattern• IEA: Inferred from electronic annotation• ISS: Inferred from sequence or structural
similarity• TAS: Traceable author statement • NAS: Non-traceable author statement • IC: Inferred by curator• RCA: Reviewed Computational Analysis• ND: no data available
GO structure
• GO isn’t just a flat list of biological terms
• terms are related within a hierarchy
GO structure
gene A
GO structure
• This means genes can be grouped according to user-defined levels
• Allows broad overview of gene set or genome
GO Annotation Stats (2007)
I
GO Annotations
Total manual GO annotations - 388,633
Total proteins with manual annotations – 80,402
Contributing Groups (including MGI): - 19
Total Pub Med References – 346,002
Total number predicted annotations – 17,029,553
Total number taxa – 129,318
Total number distinct proteins – 2,971,374
gene -> GO term
associated genes
GO annotations
GO database
genome and protein databases
Now we can query across all annotations based on shared biological activity.
Annotations of gene products to GO are genome specific
GO browser
Search on ‘mesoderm development’
mesoderm development
Definition of mesodermdevelopment
Gene productsinvolved in mesodermdevelopment
Traditional analysis
Gene 1ApoptosisCell-cell signalingProtein phosphorylationMitosis…
Gene 2Growth controlMitosisOncogenesisProtein phosphorylation…
Gene 3Growth controlMitosisOncogenesisProtein phosphorylation…
Gene 4Nervous systemPregnancyOncogenesisMitosis…
Gene 100Positive ctrl. of cell prolifMitosisOncogenesisGlucose transport…
Using GO annotations
• But by using GO annotations, this work has already been done
GO:0006915 : apoptosis
Grouping by process
ApoptosisGene 1Gene 53
MitosisGene 2Gene 5Gene45Gene 7Gene 35…
Positive ctrl. of cell prolif.Gene 7Gene 3Gene 12…
GrowthGene 5Gene 2Gene 6…
Glucose transportGene 7Gene 3Gene 6…
Anatomy of a GO term
id: GO:0006094name: gluconeogenesisnamespace: processdef: The formation of glucose fromnoncarbohydrate precursors, such aspyruvate, amino acids and glycerol.[http://cancerweb.ncl.ac.uk/omd/index.html]exact_synonym: glucose biosynthesisxref_analog: MetaCyc:GLUCONEO-PWYis_a: GO:0006006is_a: GO:0006092
unique GO IDterm name
definition
synonymdatabase ref
parentage
ontology
GO is a functional annotation system of great utility to the data-
driven biologist
GO enables genomic data analysis
• Microarrays allow biologists to record changes in gene function across entire genomes
• Result: Vast amounts of gene expression data desperately needing cataloging and tagging
• Many data analysis tools use GO graph structure to statistically evaluate clusters of co-expressed genes based on shared functional annotations
OCT 13, 2006
Cancer Genome Projects
GO supports functional classifications
GO is wildly successful
FIGURE 3. Representative cell-type-specific genes and corresponding molecular functions.
Nature: January 2007
Comprehensively annotate Reference Genomes
• Saccharomyces cerevisiae• Schizosaccharomyces
pombe• Arabidopsis thaliana
• Human• Mouse• Fly• Rat• Chicken• Zebrafish• Worm• Dicty• E.coli
Species coverage
• All major eukaryotic model organism species
• Human via GOA group at UniProt
• Several bacterial and parasite species through TIGR and GeneDB at Sanger– many more in pipeline
Annotation coverage
GO tools
• GO resources are freely available to anyone to use without restriction– Includes the ontologies, gene associations
and tools developed by GO
• Other groups have used GO to create tools for many purposes:
http://www.geneontology.org/GO.tools
GO tools
• Affymetrix also provide a Gene Ontology Mining Tool as part of their NetAffx™ Analysis Center which returns GO terms for probe sets
GO tools
• Many tools exist that use GO to find common biological functions from a list of genes:
http://www.geneontology.org/GO.tools.microarray.shtml
GO tools
• Most of these tools work in a similar way:– input a gene list and a subset of ‘interesting’
genes– tool shows which GO categories have most
interesting genes associated with them i.e. which categories are ‘enriched’ for interesting genes
– tool provides a statistical measure to determine whether enrichment is significant
GO for microarray analysis
• Annotations give ‘function’ label to genes
• Ask meaningful questions of microarray data e.g.– genes involved in the same process,
same/different expression patterns?
Using GO in practice
• statistical measure – how likely your differentially regulated genes
fall into that category by chance
microarray
1000 genesexperiment
100 genes differentially regualted
mitosis – 80/100apoptosis – 40/100p. ctrl. cell prol. – 30/100glucose transp. – 20/100
0
10
20
30
40
50
60
70
80
mitosis apoptosis positive control ofcell proliferation
glucose transport
Using GO in practice
• However, when you look at the distribution of all genes on the microarray:
Process Genes on array # genes expected in occurred 100 random genes
mitosis 800/1000 80 80apoptosis 400/1000 40 40p. ctrl. cell prol. 100/1000 10 30 glucose transp. 50/1000 5 20
Enrichment tools
• GO is developing its own enrichment tool as part of the GO browser AmiGO
• Currently in testing phase, should be released next month