How to Build an Ontology
Barry Smith
http://ontology.buffalo.edu/smith
1
Everywhere databases are being created
too often in such a way that the data is siloed
leading to massive expense in integrating data in ad hoc ways
if the data could be collected on the basis of shared controlled vocabularies from the start, much of this massive expense could be avoided
2
Uses of ‘ontology’ in PubMed abstracts
3
By far the most successful: GO (Gene Ontology)
4
Consequences of the Human Genome Project
we can match gene sequences very effectively, for example finding patterns shared between humans and mice
but we can make sense of these gene sequences only if we know
• where in the cell they occur • with what molecular functions they are associated• to what biological processes they contribute
5
GO provides a controlled system of terms for use in annotating (describing, tagging) data
• multi-species, multi-disciplinary, open source
• contributing to the cumulativity of scientific results obtained by distinct research communities
• compare use of kilograms, meters, seconds in formulating experimental results
6
Hierarchical view representing relations between represented types7
Pleural Cavity
Pleural Cavity
Interlobar recess
Interlobar recess
Mesothelium of Pleura
Mesothelium of Pleura
Pleura(Wall of Sac)
Pleura(Wall of Sac)
VisceralPleura
VisceralPleura
Pleural SacPleural Sac
Parietal Pleura
Parietal Pleura
Anatomical SpaceAnatomical Space
OrganCavityOrganCavity
Serous SacCavity
Serous SacCavity
AnatomicalStructure
AnatomicalStructure
OrganOrgan
Serous SacSerous Sac
MediastinalPleura
MediastinalPleura
TissueTissue
Organ PartOrgan Part
Organ Subdivision
Organ Subdivision
Organ Component
Organ Component
Organ CavitySubdivision
Organ CavitySubdivision
Serous SacCavity
Subdivision
Serous SacCavity
Subdivision
part
_of
is_a
8
US $100 mill. invested in literature and data curation using GO
over 11 million annotations relating gene products described in the UniProt, Ensembl and other databases to terms in the GOexperimental results reported in 52,000 scientific journal articles manually annoted by expert biologists using GO
9
GO has learned the lessons of successful cooperation
• Clear documentation• The terms chosen are already familiar• Fully open source (allows thorough testing in
manifold combinations with other ontologies)• Subjected to constant third-party critique • Updated every night
10
ontologies used to annotate databases
MouseEcotope GlyProt
DiabetInGene
GluChem
sphingolipid transporter
activity
11
annotation using common ontologies yields integration of databases
MouseEcotope GlyProt
DiabetInGene
GluChem
Holliday junction helicase complex
12
annotation using common ontologies can yield integration of image data
13
annotation using common ontologies can support comparison of image data
14
annotation with Gene Ontology
supports reusability of data
supports search of data by humans
supports reasoning with data by humans and machines
− but the method works only to the degree that many, many people use the GO to annotate their data
15
GO has been amazingly successful in overcoming the data balkanization problem
but it covers only generic biological entities of three sorts:
– cellular components– molecular functions– biological processes
and it does not provide representations of diseases, symptoms, …
16
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity(FMA, CARO)
OrganFunction
(FMP, CPRO) Phenotypic
Quality(PaTO)
Biological Process
(GO)CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Compone
nt(FMA, GO)
Cellular Function
(GO)
MOLECULEMolecule
(ChEBI, SO,RnaO, PrO)
Molecular Function(GO)
Molecular Process
(GO)
Original OBO Foundry ontologies (Gene Ontology in yellow) 17
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity(FMA, CARO)
OrganFunction
(FMP, CPRO) Phenotypic
Quality(PaTO)
Biological Process
(GO)CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Compone
nt(FMA, GO)
Cellular Function
(GO)
MOLECULEMolecule
(ChEBI, SO,RnaO, PrO)
Molecular Function(GO)
Molecular Process
(GO)
Environment Ontology
envi
ron
men
ts
are
her
e
18
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
COMPLEX OFORGANISMS
Family, Community, Deme, Population
OrganFunction
(FMP, CPRO)
Population Phenotype
PopulationProcess
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity(FMA, CARO) Phenotypic
Quality(PaTO)
Biological Process
(GO)CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Componen
t(FMA, GO)
Cellular Function
(GO)
MOLECULEMolecule
(ChEBI, SO,RnaO, PrO)
Molecular Function(GO)
Molecular Process
(GO)
order 19
Ontology success stories, and some reasons for failure
•
chaos 20
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
COMPLEX OFORGANISMS
Family, Community, Deme, Population
OrganFunction
(FMP, CPRO)
Population Phenotype
PopulationProcess
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity(FMA, CARO) Phenotypic
Quality(PaTO)
Biological Process
(GO)CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Componen
t(FMA, GO)
Cellular Function
(GO)
MOLECULEMolecule
(ChEBI, SO,RnaO, PrO)
Molecular Function(GO)
Molecular Process
(GO)
http://obofoundry.org 21
Developers commit to working to ensure that, for each domain, there is community convergence on a single ontology
and agree in advance to collaborate with developers of ontologies in adjacent domains.
http://obofoundry.org
The OBO Foundry: a step-by-step, evidence-based approach to expand
the GO
22
OBO Foundry Principles
Common governance (coordinating editors)
Common training
Common architecture
• simple shared top level ontology
• shared Relation Ontology: www.obofoundry.org/ro
23
Pleural Cavity
Pleural Cavity
Interlobar recess
Interlobar recess
Mesothelium of Pleura
Mesothelium of Pleura
Pleura(Wall of Sac)
Pleura(Wall of Sac)
VisceralPleura
VisceralPleura
Pleural SacPleural Sac
Parietal Pleura
Parietal Pleura
Anatomical SpaceAnatomical Space
OrganCavityOrganCavity
Serous SacCavity
Serous SacCavity
AnatomicalStructure
AnatomicalStructure
OrganOrgan
Serous SacSerous Sac
MediastinalPleura
MediastinalPleura
TissueTissue
Organ PartOrgan Part
Organ Subdivision
Organ Subdivision
Organ Component
Organ Component
Organ CavitySubdivision
Organ CavitySubdivision
Serous SacCavity
Subdivision
Serous SacCavity
Subdivision
part
_of
is_a
24
Open Biomedical Ontologies Foundry
Seeks to create high quality, validated terminology modules across all of the life sciences which will be
• one ontology for each domain, so no need for mappings
• close to language use of experts
• evidence-based
• incorporate a strategy for motivating potential developers and users
• revisable as science advances
25
Benefits of coordination
• Can profit from lessons learned through mistakes made by others
• Can more easily reuse what is made by others
• Can more easily inspect and criticize results of others’ work
• Can more easily train people to do the necessary work
BFO Top-Level Ontology
ContinuantOccurrent
(always dependent on one or more
independent continuants)
IndependentContinuant
DependentContinuant
27
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity
(FMA, CARO)
OrganFunction
(FMP, CPRO) Phenotypic
Quality(PaTO)
Organism-Level Process
(GO)
CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Compone
nt(FMA, GO)
Cellular Function
(GO)
Cellular Process
(GO)
MOLECULEMolecule
(ChEBI, SO,RnaO, PrO)
Molecular Function(GO)
Molecular Process
(GO)
OBO Foundry coverage
GRANULARITY
RELATION TO TIME
28
List of BFO users
http://www.ifomis.org/bfo/users
29
BFO UsersACGT Master Ontology (ACGT MO): represent the domain of cancer research and management in a computationally tractable manner
AFO Foundational Ontology
Biomedical Ethics Ontology
Biomedical Grid Terminology (BiomedGT): open, collaboratively developed terminology for translational research
BioTop: A Biomedical Top-Domain Ontology
BIRNLex: controlled terminology for annotation of BIRN data sources
Cell Cycle Ontology: application ontology for the representation and integrated analysis of the cell cycle process
Cell Ontology: designed as a structured controlled vocabulary for cell types
Chemical Entities of Biological Interest (ChEBI): freely available dictionary of molecular entities focused on .small. chemical compounds
Cognitive Paradigm Ontology
Common Anatomy Reference Ontology (CARO): anatomical structures in all organisms
Drug Interaction Ontology (DIO): ontology-driven inferences of possible drug-drug Interactions
Dynamic Earth Sciences Ontologies: Process and Event Ontologies
Environment Ontology: an ontology that supports the annotation of the environment of any organism or biological sample
Evolution Ontology (EO)
FlyBase: enhancing Drosophila Gene Ontology annotations
Foundational Model of Anatomy (FMA): structure of the mammalian and in particular the human body (Further reading)
Gene Ontology (GO): attributes of gene products in all organisms
Infectious Disease Ontology at the Duke University Medical Center
Information Artifact Ontology (IAO)
Interdisciplinary Prostate Ontology Project (IPOP)
Lipid Ontology
medicognos: medical knowledge and workflow management framework with integrated DSS for quality, safety and disease management applications
MIRO and IRbase: IT Tools for the Epidemiological Monitoring of Insecticide Resistance in Mosquito Disease Vectors
Nanoparticle Ontology (NPO): Ontology for Cancer Nanotechnology Research
Neuroscience Information Framework Standard (NIFSTD) Ontology: a collection of OWL modules covering distinct domains of biomedical reality
Neural Electromagnetic Ontologies (NEMO): Ontology-based Tools for Representation and Integration of Event-related Brain Potentials
Ontology of Clinical Research (OCRe)
Ontology-Based eXtensible Data Model (OBX)
Ontology of Data Mining Investigationsi (OntoDM)
Ontology for Biomedical Investigations (OBI): design, protocol, instrumentation, and analysis applied in biomedical investigations
Ontology for General Medical Science (OGMS)
Ontology of Biomedical Reality for the pathology domain of spine (scoliosis domen) (OBR-Scolio)
Petrochemical Ontology
Phenotypic Quality Ontology (PaTO): qualities of biomedical entities
Proteomics data and process provenance ontology (ProPreO): bioinformatics for glycan expression, integrated technology resource for biomedical glycomics
Protein Ontology (PRO): protein types and modifications classified on the basis of evolutionary relationships
RNA Ontology (RnaO): RNA features, interactions and motifs
Senselab Ontology with applications to NeuronDB and BrainPharm
Sequence Ontology (SO): features and properties of nucleic sequences
Sleep Domain Ontology
Subcellular Anatomy Ontology (SAO) of NCMIR
Translaftional Medicine Ontology
Vaccine Ontology (VO)
yOWL: ontology-driven knowledge base for yeast biologists
Zebrafish Anatomical Ontology (ZAO): anatomical structures in D. rerio
30
How to build an ontology• import BFO into Protégé• work with domain experts to create an initial mid-
level classification• find ~50 most commonly used terms
corresponding to types in reality• arrange these terms into an informal is_a
hierarchy according to the principle• A is_a B every instance of A is an instance of B• fill in missing terms to give a complete hierarchy• (leave it to domain experts to populate the lower
levels of the hierarchy)31
Example: The Cell Ontology
Basic distinction among entities
type vs. instance
(science text vs. diary)
(human being vs. Tom Cruise)
(science diagram vs. photograph)33
Terms in ontologies denote types (‘universals’)
it is generalizations that are important = types, types,
kinds, species
34
A 515287 DC3300 Dust Collector Fan
B 521683 Gilmer Belt
C 521682 Motor Drive Belt
Catalog vs. inventory
35
types vs. instances
36
names of instances
37
names of types
38
An ontology is a representation of types
We learn about types in reality from looking at the results of scientific experiments in the form of scientific theories
experiments relate to what is particular science describes what is general
39
siamese
mammal
cat
organism
objecttypes
animal
frog
instances40
Ontologies are here
41
or here
42
Ontologies represent general structures in reality (leg)
43
Ontologies do not represent concepts in people’s heads
44
They represent types in reality
45
Inventory vs. Catalog:Two kinds of representational
artifact
Databases represent instances
Ontologies represent types
46
How do we know which general terms designate types?
Types are repeatables:
cell, electron, weapon, F16, citizen, refugee, ...
Instances are one-off: Bill Clinton, this laptop
47
BFO Top-Level Ontology
ContinuantOccurrent
(always dependent on one or more
independent continuants)
IndependentContinuant
DependentContinuant
48
Two kinds of entities
occurrents (processes, events, happenings)
continuants (objects, qualities, states...)
49
You are a continuant
Your life is an occurrent
You are 3-dimensional
Your life is 4-dimensional
50
BFO Top-Level Ontology
ContinuantOccurrent
(always dependent on one or more
independent continuants)
IndependentContinuant
DependentContinuant
51
Dependent entities
require independent continuants as their bearers
There is no run without a runner
There is no grin without a cat
52
Dependent vs. independent continuants
Independent continuants (organisms, buildings, environments)
Dependent continuants (quality, shape, role, propensity, function, status, power, right)
53
All occurrents are dependent entities
They are dependent on those independent continuants which are their participants (agents, patients, media ...)
54
Principle of Low Hanging Fruit
Include even absolutely trivial assertions (assertions you know to be universally true)
pneumococcal bacterium is_a bacterium
Computers need to be led by the hand
55
Principle of singular nouns
Terms in ontologies represent types
Goal: Each term in an ontology should represent exactly one type
Thus every term should be a singular noun
56
MeSH
MeSH Descriptors Index Medicus Descriptor Anthropology, Education, Sociology and Social Phenomena (MeSH Category) Social Sciences Political Systems National Socialism
National Socialism is_a Political SystemsNational Socialism is_a Anthropology ...
57
Principle: distinguish use from mention
mouse =def. common name for the species mus musculus
swimming is healthy and has eight letters
58
How to avoid the use-mention confusion
Avoid confusing between words and things
Avoid confusing between concepts in our minds and entities in reality
Recommendation: avoid the word ‘concept’ entirely
59
Three Levels
L3. Words, models (published representations, ontologies, databases ...)
L2. Ideas (thoughts, memories, ...)
L1. Things (cells, planets, processes of cell division ...)
60
‘Heparin therapy’ is an instance of ‘written or spoken designation of a concept’
What are the problems here?
1. misuse of quotation marks
2. confusion of instances and types
3. confusion of concept and reality
Trialbank
61
Principle: Avoid mass nouns
Brenda Tissue Ontology
blood is_a hematopoietic system
hematopoietic system is_a whole body
whole_body is_a animal
62
Count vs. mass nouns
Count
suitcase
cow
datum
Mass
luggage
beef
information
63
Principle of definitions
Supply definitions for every term
1.human-understandable natural language definition
2.an equivalent formal definition
64
Principle: definitions must be unique
Each term should have exactly one definition
it may have both natural-language and formal versions
(issue with ontologies which exist with different levels of expressivity)
65
The Problem of Circularity
A Person =def. A person with an identity document
Hemolysis =def. The causes of hemolysis
66
Principle of non-circularity
The term defined should not appear in its own definition
67
Principle of Aristotelian definitions
Use Aristotelian definitions
An A is a B which C’s.
A human being is an animal which is rational
68
Principle of increase in understandability
A definition should use only terms which are easier to understand than the term defined
Definitions should not make simple things more difficult than they are
69
HL7
‘stopping a medication’ = def.
change of state in the record of a Substance Administration Act from Active to Aborted
70
Univocity Terms should have the same meanings on
every occasion of use.
(= They should refer to the same types)
Basic ontological relations such as is_a and part_of should be used in the same way by all ontologies
71
Universality: the all-some rule
Ontologies are made of relational assertions
They should include only those relational assertions which hold universally
Cell membrane part_of cell
72
universality
Often, order will matter:
We can assert
adult transformation_of child
but not
child transforms_into adult
73
universality
viral pneumonia caused by virus
but not
virus causes pneumonia
pneumococcal virus causes pneumonia
74
Principle of Universality
results analysis later_than protocol-design
but not
protocol-design earlier_than results analysis
75
Principle of positivityComplements of types are not themselves types.
Terms such as
non-mammal non-membrane other metalworker in New Zealand
do not designate types in reality
76
Avoid conjunctive and disjunctive combinations
There are no conjunctive and disjunctive types:
anatomic structure, system, or substance
musculoskeletal and connective tissue disorder
77
Principle: Don’t confuse ontology and epistemology
Which types exist in reality is not a function of our knowledge.
Terms such as
unknown
unclassified
unlocalized
arthropathies not otherwise specified
do not designate types in reality.78
Principle: Don’t confuse ontology and epistemology
If you want to say that
We do not know where A’s are located
do not invent a new class of
A’s with unknown locations
(A well-constructed ontology should grow linearly; it should not need to delete classes or relations because of increases in knowledge)
79
If you want to say
I surmise that this is a case of pneumonia
do not invent a new class of surmised pneumonias
Confusion of ‘findings’ in medical terminologies
Principle: Don’t confuse ontology and epistemology
80
is_a Overloading
The success of ontology alignment demands that ontological relations (is_a, part_of, ...) have the same meanings in the different ontologies to be aligned.
81
Principle: is_a should always mean is a subtype of
John is_a human being
biological process is_a Gene Ontology (old GO)
Achham cattle breed is_a organism (SNOMED)
82
Multiple Inheritance
thing
carblue thing
blue car
is_a1 is_a2
83
How to solve this problem
Create two ontologies:
of cars
of colors
Link the two together via cross-products
(= factoring, normalization, modularization)
84
Compositionality
The meanings of compound terms should be determined
1. by the meanings of component terms
together with
2. the rules governing syntax
85
Single Inheritance
No kind in a classificatory hierarchy should be asserted to have more than one is_a parent on the immediate higher level
86
Multiple Inheritance
thing
carblue thing
blue car
is_a is_a
87
Multiple Inheritance
is a source of errors
encourages laziness
serves as obstacle to integration with neighboring ontologies
hampers use of Aristotelian methodology for defining terms
hampers use of statistical search tools
88
Multiple Inheritance
thing
carblue thing
blue car
is_a1 is_a2
89
Principle of asserted single inheritance
Each reference ontology module should be built as an asserted monohierarchy (a hierarchy in which each term has at most one parent)
Asserted hierarchy vs. inferred hierarchy
90
Principle of normalization
Polyhierarchies should be decomposable into homogeneous disjoint monohierarchies
91
Principle of instantiation
A term should be included in an ontology only if there is evidence that instances to which that term refers exist in reality.
92
Why do we need rules/standards for good ontology?
Ontologies must be intelligible both to humans (for annotation and curation) and to machines (for reasoning and error-checking): the lack of rules for classification leads to human error and blocks automatic reasoning and error-checking
Intuitive rules facilitate training of curators and annotators
Common rules allow alignment with other ontologies
93
Ontology path dependence principle
The decisions made by the creators of an ontology – including those decisions which pertain to the ontology’s upper-level architecture – should as far as possible be made on the basis of the degree to which they advance the consistency of that ontology with the reference ontologies already existing in relevant domains.
94
User feedback principle
An ontology should evolve on the basis of feedback derived from those who are using the ontology, for example for purposes in annotation.
95