Post on 18-Dec-2015
transcript
Real-life ontology development:
Real-life ontology development:
lessons from the Gene Ontology
lessons from the Gene Ontology
• What is GO?• Evolution of GO• Mechanisms of updating GO• Tools for ontology development• Lessons learned
• What is GO?• Evolution of GO• Mechanisms of updating GO• Tools for ontology development• Lessons learned
Gene OntologyGene Ontology
• Built for a very specific purpose:“annotation of genes and proteins in
genomic and protein databases”• Applicable to all species
• Built for a very specific purpose:“annotation of genes and proteins in
genomic and protein databases”• Applicable to all species
Gene Ontology - scopeGene Ontology - scope
• Three disjoint axes:– molecular function
• molecular role e.g. catalytic activity, binding
– biological process• broad biological phenomena e.g. mitosis, growth,
digestion
– cellular component• sub-cellular location e.g nucleus, ribosome, origin
recognition complex
• Three disjoint axes:– molecular function
• molecular role e.g. catalytic activity, binding
– biological process• broad biological phenomena e.g. mitosis, growth,
digestion
– cellular component• sub-cellular location e.g nucleus, ribosome, origin
recognition complex
Gene OntologyGene Ontology
• Directed acyclic graph (DAG)• Terms connected by two transitive
relations (edges):– is_a– part_of
• Directed acyclic graph (DAG)• Terms connected by two transitive
relations (edges):– is_a– part_of
Gene OntologyGene Ontology
• Developed by an international consortium– about 50 members
• Editorial office, 4 full-time editors (ish)• Many other part-time editors at
databases• Multiple changes made a day
– made live immediately
• Developed by an international consortium– about 50 members
• Editorial office, 4 full-time editors (ish)• Many other part-time editors at
databases• Multiple changes made a day
– made live immediately
Gene OntologyGene Ontology
• Main ontology format OBO flat file• Changes are live immediately
– no releases
• Propagated to GO database– monthly snapshots archived
• Main ontology format OBO flat file• Changes are live immediately
– no releases
• Propagated to GO database– monthly snapshots archived
Evolution of GOEvolution of GO
• Original GO created in 2000• Three databases involved:
– FlyBase (Drosophila)– MGI (Mouse)– SGD (S. cerevisae)
• Used immediately
• Original GO created in 2000• Three databases involved:
– FlyBase (Drosophila)– MGI (Mouse)– SGD (S. cerevisae)
• Used immediately
Evolution of GOEvolution of GO
• Later databases:– TAIR (Arabadopsis)– TIGR (microbes including prokaryotes)– SWISS-PROT (several thousand species inc. human)– PSU (P. falciparum)
• Recent additions– ZFIN (zebrafish)– PAMGO (plant pathogens)
• Later databases:– TAIR (Arabadopsis)– TIGR (microbes including prokaryotes)– SWISS-PROT (several thousand species inc. human)– PSU (P. falciparum)
• Recent additions– ZFIN (zebrafish)– PAMGO (plant pathogens)
Evolution of GOEvolution of GO
• GO development traditionally annotation-driven– development directed by use
• Terms added as new species annotated• Terms added on as as-needed basis
• GO development traditionally annotation-driven– development directed by use
• Terms added as new species annotated• Terms added on as as-needed basis
Evolution of GOEvolution of GO
• Resulted in ‘organic’ structure, little formality
• Ontological formality added subsequently– philosophical and logical
• Resulted in ‘organic’ structure, little formality
• Ontological formality added subsequently– philosophical and logical
Growth of GOGrowth of GOGO term history 2001 - 2007
0
5000
10000
15000
20000
25000
30000
Jan-01Apr-01Jul-01Oct-01Jan-02Apr-02Jul-02Oct-02Jan-03Apr-03Jul-03Oct-03Jan-04Apr-04Jul-04Oct-04Jan-05Apr-05Jul-05Oct-05Jan-06Apr-06Jul-06Oct-06Jan-07
Date
Number of terms
obsolete
undefined terms
defined terms
Modifying the graph:
• But then I need to annotate VW Beetles, pre-1980
• The graph no longer works, because the engine is in the boot
Mechanisms for ontology changeMechanisms for ontology change• Small incremental changes• Initially all changes to the
ontologies made this way
• Small incremental changes• Initially all changes to the
ontologies made this way
Mechanisms for ontology changeMechanisms for ontology change• Suggested changes initially
submitted by email• Moved to an online tracking
system when this became unmanageable
• Suggested changes initially submitted by email
• Moved to an online tracking system when this became unmanageable
Requesting changes to GO - curator requests trackerRequesting changes to GO - curator requests tracker• Web-based tracking system hosted at
SourceForge.net• Public• Tracker item for each new request or
question
• Web-based tracking system hosted at SourceForge.net
• Public• Tracker item for each new request or
question
Mechanisms for ontology changeMechanisms for ontology change• Problems:
– Larger questions about the higher ontology structure remain unresolved
– Makes some items impossible to close– No sense of the ‘big picture’– Large areas of the ontologies missing
or incomplete because no annotations– Massive volume
• needed to increase the number of editors
• Problems:– Larger questions about the higher
ontology structure remain unresolved– Makes some items impossible to close– No sense of the ‘big picture’– Large areas of the ontologies missing
or incomplete because no annotations– Massive volume
• needed to increase the number of editors
Mechanisms for ontology changeMechanisms for ontology change• Larger-scale changes:
– content meetings– interest groups
• Larger-scale changes:– content meetings– interest groups
Content meetingsContent meetings
• Short meetings aimed at developing specific areas of GO ontology content– proposals refined and discussed before
meeting– small number of people (10-15)– invited experts– specific topics
• Short meetings aimed at developing specific areas of GO ontology content– proposals refined and discussed before
meeting– small number of people (10-15)– invited experts– specific topics
Content meetingsContent meetings
• Further refinements made following meeting by email
• Changes are made once consensus reached
• Large number of terms typically added (500+)
• Further refinements made following meeting by email
• Changes are made once consensus reached
• Large number of terms typically added (500+)
Content meetingsContent meetings
• Recent meetings:– immunology– interactions between organisms– CNS development
• Recent meetings:– immunology– interactions between organisms– CNS development
Content meetingsContent meetings
• Advantages– Allows a lot of detailed work to be
done on a very specific area– Involves external expertise
• Advantages– Allows a lot of detailed work to be
done on a very specific area– Involves external expertise
Content meetingsContent meetings
• Problems:– Expensive - everyone has to be in the
same location– Only works for very specific topics– Long lag time getting terms into
ontologies
• Problems:– Expensive - everyone has to be in the
same location– Only works for very specific topics– Long lag time getting terms into
ontologies
Interest groupsInterest groups
• Groups of experts for a specific topic– e.g. development, cell cycle, plants
• Includes GO curators/annotators and external experts
• Don’t typically meet face to face
• Groups of experts for a specific topic– e.g. development, cell cycle, plants
• Includes GO curators/annotators and external experts
• Don’t typically meet face to face
Interest groupsInterest groups
• Communicate via email, desktop sharing etc
• Transporters area of the ontology recently revised this way
• Communicate via email, desktop sharing etc
• Transporters area of the ontology recently revised this way
Interest groupsInterest groups
• Advantages– Cheap, no travel required– Allows a lot of detailed work to be
done on a very specific area– Involves external expertise
• Advantages– Cheap, no travel required– Allows a lot of detailed work to be
done on a very specific area– Involves external expertise
Interest groupsInterest groups
• Disadvantages– Harder to reach consensus when not
face to face– Projects tend to drag on
• Disadvantages– Harder to reach consensus when not
face to face– Projects tend to drag on
Mechanisms for ontology changeMechanisms for ontology change• Systematic changes via small working
groups• Systematic changes via small working
groups
Systematic changesSystematic changes
• Projects not directly related to biological content
• Systematic changes throughout ontology
• Small group of GO consortium members– meets regularly by desktop sharing, voice
over IP
• Experts recruited to meetings as needed
• Projects not directly related to biological content
• Systematic changes throughout ontology
• Small group of GO consortium members– meets regularly by desktop sharing, voice
over IP
• Experts recruited to meetings as needed
Systematic changesSystematic changes
• Changes either– made on a branch of the ontology and
merged in later• always have big problems merging branched file
into main file
– merged directly into live ontology after session
• fast, but people get angry
• Changes either– made on a branch of the ontology and
merged in later• always have big problems merging branched file
into main file
– merged directly into live ontology after session
• fast, but people get angry
is_a completeis_a complete
• GO contains both is_a and part_of relations
• Typically, graphs a mixture of incomplete is_a and part_of hierarchies
• A result of ‘organic’ evolution of GO• All graphs now have complete is_a
paths to root
• GO contains both is_a and part_of relations
• Typically, graphs a mixture of incomplete is_a and part_of hierarchies
• A result of ‘organic’ evolution of GO• All graphs now have complete is_a
paths to root
partial disjointnesspartial disjointness
• Biological process terms organised by granularity:– cellular process– multicellular organism process– multi-organism process
• To avoid massive increase in number of paths to root, these terms are disjoint– no is_a children in common
• Biological process terms organised by granularity:– cellular process– multicellular organism process– multi-organism process
• To avoid massive increase in number of paths to root, these terms are disjoint– no is_a children in common
sensusensu
• sensu (meaning ‘in the sense of’) used to disambiguate, by taxonomic group, terms with identical strings but different meanings
• e.g. sporulation (sensu Viridiplantae) v/s sporulation (sensu Bacteria)
• sensu (meaning ‘in the sense of’) used to disambiguate, by taxonomic group, terms with identical strings but different meanings
• e.g. sporulation (sensu Viridiplantae) v/s sporulation (sensu Bacteria)
sensusensu
• Current project to remove the sensu term strings
• Replace with strings that represent the true differentiae
• e.g. – cell wall (sensu Bacteria) -> peptidoglycan-
based cell wall– cell wall (sensu Fungi) -> chitin- and beta-
glucan-containing cell wall
• Current project to remove the sensu term strings
• Replace with strings that represent the true differentiae
• e.g. – cell wall (sensu Bacteria) -> peptidoglycan-
based cell wall– cell wall (sensu Fungi) -> chitin- and beta-
glucan-containing cell wall
• Advantages– Fast– Efficient– Small number of people required
• Advantages– Fast– Efficient– Small number of people required
Systematic changes to GOSystematic changes to GO
• Disadvantages– Difficult to obtain wider consensus– Changes sometimes have to be
undone
• Disadvantages– Difficult to obtain wider consensus– Changes sometimes have to be
undone
Systematic changes to GOSystematic changes to GO
Useful tools for ontology developmentUseful tools for ontology development• WebEx
– desktop sharing, can control each others desktops
• wiki– mainly internal
• Skype– free international calls!
• conference calls– not free
• WebEx– desktop sharing, can control each others
desktops
• wiki– mainly internal
• Skype– free international calls!
• conference calls– not free
Tracking changes to GOTracking changes to GO
• General tracking– files stored in cvs, all differences
trackable (in theory)– far from ideal - frequent discussion is
should we history track, date-stamp terms?
• General tracking– files stored in cvs, all differences
trackable (in theory)– far from ideal - frequent discussion is
should we history track, date-stamp terms?
Tracking changes to GOTracking changes to GO
• Obsolete terms– formerly stored within the ontology– in OBO format made a special kind of
deprecated term (tag is_obsolete)– Soon to create ‘replaced_by’ and
‘consider’ tags to point to live terms
• Obsolete terms– formerly stored within the ontology– in OBO format made a special kind of
deprecated term (tag is_obsolete)– Soon to create ‘replaced_by’ and
‘consider’ tags to point to live terms
Tracking changes to GOTracking changes to GO
• Crediting experts– traditionally no mechanism for doing
this– creating abstracts for content
meetings, adding tag to term– as yet no mechanism for crediting
individuals
• Crediting experts– traditionally no mechanism for doing
this– creating abstracts for content
meetings, adding tag to term– as yet no mechanism for crediting
individuals
Useful tools for ontology developmentUseful tools for ontology development• OBO-Edit
– ontology editor originally developed for GO
– can be used for any OBO format ontology
– developed by group of users
• OBO-Edit– ontology editor originally developed
for GO– can be used for any OBO format
ontology– developed by group of users
Useful tools for ontology developmentUseful tools for ontology development• Reasoner integrated into OBO-Edit
– based on OBOL– detects missing links, redundant links, – soon misplaced terms, automatic
term creation
• Validation system– typographical errors, is_a orphans,
duplicate synonyms etc.
• Reasoner integrated into OBO-Edit– based on OBOL– detects missing links, redundant links, – soon misplaced terms, automatic
term creation
• Validation system– typographical errors, is_a orphans,
duplicate synonyms etc.
Lessons learnedLessons learned
• An ontology doesn’t have to be perfect or complete to be used
• For domain ontologies, external experts should be involved
• Communication is critical• You will never please everyone
• An ontology doesn’t have to be perfect or complete to be used
• For domain ontologies, external experts should be involved
• Communication is critical• You will never please everyone