Introduction
• How we do bioinformatics• What is knowledge• What is an ontology• Classes, individuals, …• The components of an ontology• Examples
How We Do bioinformatics
• No Euclid, no Newton• No equations and no axioms• Cannot take an amino acid sequence, submit to an
equation and get some biology• … so we do similarity searches
Transferring Characteristics
Uncharacterised protein
Tra1 La2 La3
High similarity transfer characteristics
What do we Transfer?
• When sequences sufficiently similar we transfer what we understand about one sequence to another
• The “understanding” is our knowledge about that protein
What is Knowledge?
• Knowledge – all information and an understanding to carry out tasks and to infer new information
• Information -- data equipped with meaning
• Data -- un-interpreted signals that reach our senses
Michael AshburnerProfessor
University of CambridgeUK
ISMB
NameJob
InstitutionCountry
Conf
manacademic, senior
ancient university, 5 ratedEuropean
important figure in biology
BIOLOGY
Uniprot:- A protein database?ID PRIO_HUMAN STANDARD; PRT; 253 AA.AC P04156;DT 01-NOV-1986 (Rel. 03, Created)DT 01-NOV-1986 (Rel. 03, Last sequence update)DT 20-AUG-2001 (Rel. 40, Last annotation update)DE Major prion protein precursor (PrP) (PrP27-30) (PrP33-35C) (ASCR).GN PRNP.OS Homo sapiens (Human).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.OX NCBI_TaxID=9606;RN [1]RP SEQUENCE FROM N.A.RX MEDLINE=86300093; PubMed=3755672;RA Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H.,RA Prusiner S.B., Dearmond S.J.;RT "Molecular cloning of a human prion protein cDNA.";RL DNA 5:315-324(1986).RN [2]RP SEQUENCE OF 8-253 FROM N.A.RX MEDLINE=86261778; PubMed=3014653;RA Liao Y.-C.J., Lebo R.V., Clawson G.A., Smuckler E.A.;RT "Human prion protein cDNA: molecular cloning, chromosomal mapping,RT and biological implications.";RL Science 233:364-367(1986).RN [3]RP SEQUENCE OF 58-85 AND 111-150 (VARIANT AMYLOID GSS).RX MEDLINE=91160504; PubMed=1672107;RA Tagliavini F., Prelli F., Ghiso J., Bugiani O., Serban D.,RA Prusiner S.B., Farlow M.R., Ghetti B., Frangione B.;RT "Amyloid protein of Gerstmann-Straussler-Scheinker disease (IndianaRT kindred) is an 11 kd fragment of prion protein with an N-terminalRT glycine at codon 58.";RL EMBO J. 10:513-519(1991).RN [4]RP STRUCTURE BY NMR OF 118-221.RX MEDLINE=20359708; PubMed=10900000;RA Calzolai L., Lysek D.A., Guntert P., von Schroetter C., Riek R.,RA Zahn R., Wuethrich K.;RT "NMR structures of three single-residue variants of the human prionRT protein.";RL Proc. Natl. Acad. Sci. U.S.A. 97:8340-8345(2000).CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THECC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLEDCC "RODS".CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.CC -!- POLYMORPHISM: THE FIVE TANDEM OCTAPEPTIDE REPEATS REGION IS HIGHLYCC UNSTABLE. INSERTIONS OR DELETIONS OF OCTAPEPTIDE REPEAT UNITS ARECC ASSOCIATED TO PRION DISEASE.
FT SIGNAL 1 22FT CHAIN 23 230 MAJOR PRION PROTEIN.FT PROPEP 231 253 REMOVED IN MATURE FORM (BY SIMILARITY).FT LIPID 230 230 GPI-ANCHOR (BY SIMILARITY).FT CARBOHYD 181 181 N-LINKED (GLCNAC...) (PROBABLE).FT DISULFID 179 214 BY SIMILARITY.FT DOMAIN 51 91 5 X 8 AA TANDEM REPEATS OF P-H-G-G-G-W-G-FT Q.FT REPEAT 51 59 1.FT REPEAT 60 67 2.FT REPEAT 68 75 3.FT REPEAT 76 83 4.FT REPEAT 84 91 5.FT IN PATIENTS WHO HAVE A PRP MUTATION ATFT CODON 178: PATIENTS WITH MET DEVELOP FFI,FT THOSE WITH VAL DEVELOP CJD).FT /FTId=VAR_006467.FT VARIANT 171 171 N -> S (IN SCHIZOAFFECTIVE DISORDER).FT /FTId=VAR_006468.FT VARIANT 178 178 D -> N (IN FFI AND CJD).FT /FTId=VAR_006469.FT VARIANT 180 180 V -> I (IN CJD).FT /FTId=VAR_006470.FT VARIANT 183 183 T -> A (IN FAMILIAL SPONGIFORMFT ENCEPHALOPATHY).FT /FTId=VAR_006471.FT VARIANT 187 187 H -> R (IN GSS).FT /FTId=VAR_008746.FT VARIANT 188 188 T -> K (IN EOAD; DEMENTIA ASSOCIATED TOFT PRION DISEASES).FT /FTId=VAR_008748.FT VARIANT 188 188 T -> R.FT /FTId=VAR_008747.FT VARIANT 196 196 E -> K (IN CJD).FT /FTId=VAR_008749.FT /FTId=VAR_006472.SQ SEQUENCE 253 AA; 27661 MW; 43DB596BAAA66484 CRC64;MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA VVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV ILLISFLIFL IVG//
CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE CC BRAIN OF HUMANS AND ANIMALS INFECTEDCC WITH NEURODEGENERATIVE DISEASES KNOWN ASCC TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION CC DISEASES,LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), CC GERSTMANN-STRAUSSLER SYNDROME (GSS), FATAL CC FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; CC SCRAPIE IN SHEEP AND GOAT; BOVINE SPONGIFORM CC ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE CC MINK ENCEPHALOPATHY (TME); CHRONIC WASTINGCC DISEASE (CWD) OF MULE DEER AND ELK; FELINE CC SPONGIFORM ENCEPHALOPATHY (FSE) IN CATS AND CC EXOTIC UNGULATE ENCEPHALOPATHY (EUE) IN CC NYALA AND GREATER KUDU. THE PRION DISEASES CC ILLUSTRATE THREE MANIFESTATIONS OF CNS CC DEGENERATION: (1) INFECTIOUS (2)CC SPORADIC AND (3) DOMINANTLY INHERITED FORMS.CC TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TOCC OCCUR AFTER CONSUMPTION OF PRION-INFECTEDCC FOODSTUFFS.DR EMBL; M13667; AAA19664.1; -.DR EMBL; M13899; AAA60182.1; -.DR EMBL; D00015; BAA00011.1; -.DR PIR; A05017; A05017.DR PIR; A24173; A24173.DR PIR; S14078; S14078.DR PDB; 1E1G; 20-JUL-00.DR PDB; 1E1J; 20-JUL-00.DR PDB; 1E1P; 20-JUL-00.DR PDB; 1E1S; 21-JUL-00.DR PDB; 1E1U; 20-JUL-00.DR PDB; 1E1W; 20-JUL-00. DR MIM; 176640; -.DR MIM; 123400; -.DR MIM; 137440; -.DR MIM; 245300; -.DR MIM; 600072; -.DR MIM; 604920; -.DR InterPro; IPR000817; Prion.DR Pfam; PF00377; prion; 1.DR PRINTS; PR00341; PRION.DR SMART; SM00157; PRP; 1.DR PROSITE; PS00291; PRION_1; 1.DR PROSITE; PS00706; PRION_2; 1.KW Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal;KW 3D-structure; Polymorphism; Disease mutation.
Words in Bioinformatics
“When I use a word,” Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to mean - neither more nor less.”
“The question is,” said Alice, “whether you can make words mean so many different things.”
“The question is,” said Humpty Dumpty, “which is to be master - that’s all”
Through the Looking Glass Lewis Carroll
Post-Genomic Biology
• Fly, mouse, yeast, worm all have their own terminologies
• I want to compare genomes• How?• Sequences comparable• What we know about sequences is not (by human or
machine)• Need a common understanding of what sequences
do
A Shared Understanding
• Synonyms and homonyms are rife• Need to know that terms in one resource mean the
same in another resource• Means comparisons are much easier: Can ask
questions over many resources• A structure of relationships enables discovery and
query abstractions• Useful for both humans and computers• The Gene Ontology allows queries outside one model
organismm
Gene Ontology http://www.geneontology.org
“a dynamic controlled vocabulary that can be applied to all eukaryotes”
Built by the community for the community.
Three organising principles: Molecular function, Biological
process, Cellular component Describes kinds of things and
parts of things Describes ~17,000 things
London Bills of Mortality
Aggregated Stats
The art of ranking things in genera and species is of no small importance and very much assists our judgment as well as our memory. You know how much it matters in botany, not to mention animals and other substances, or again moral and notional entities as some call them. Order largely depends on it, and many good authors write in such a way that their whole account could be divided and subdivided according to a procedure related to genera and species. This helps one not merely to retain things, but also to find them. And those who have laid out all sorts of notions under certain headings or categories have done something very useful.
Gottfried Wilhelm Leibniz, New Essays on Human Understanding
Ontology
• Semantics – the meaning of meaning.• Philosophical discipline, branch of philosophy that
deals with the nature and the organisation of reality.• Science of Being (Aristotle, Metaphysics, IV,1)• What is being?• What are the features common to all beings?
So What?
• Describing what “exists” in our domain• We have Protein, Gene, Intron, Exon, Hydrolase activity, etc.• We can also describe how these “things” relate to each other• We can define what they mean; define the properties of these
things such that we can recognise those things• We are capturing our understanding• Sharing this understanding between humans and computer• Making what we understand explicit
What Is An Ontology?
• No universally agreed-upon definition• A “specification of a conceptualisation”• Conceptualisation refers to the set of concepts that
people use to talk about a given domain and the relationships among these concepts
• A set of vocabulary terms and definitions that capture a community’s understanding of their domain
• CS has perverted the original philosophy• Ontology == conceptual model of a domain
What Is An Ontology?
Elements that most agree on:
– classes = sets of things– instances = members of classes– relationships– axioms = additional logical statements
What Is An Ontology?
• Idea of a controlled vocabulary:
– Each element has a unique name– Each element has a specified definition– For a given entity or relationship in the
domain, there should only be one element in the ontology representing it
– Ask for hydrolase actibity” and get all and only hydrolase activity
What Is an Ontology?
• Hierarchy (or taxonomy) is very important:
– Classes arranged into a hierarchy
– subclass = descendant class– direct subclass = child class– superclass = ancestor class– direct superclass = parent
class
What Is an Ontology?
• Can be a single hierarchy, in which each class can only have one direct superclass, or a multiple hierarchy (or polyhierarchy), in which each class can have more than one direct superclass
• is-a relationship between a class and its superclass(es)• A class inherits the properties that have been defined for its
superclass(es)
Why develop an ontology?
• To make domain assumptions explicit
– Easier to change domain assumptions– Easier to understand and update legacy data
• To separate domain knowledge from operational knowledge
– Re-use domain and operational knowledge separately
• A community reference for applications• To share a consistent understanding of what information means.
Classes
• Classes: Sets of things in the world (nouns)• Classes of individuals• Classes: Person, protein, gene, DNA• Individuals: Robert (NE 67 51 48A), a LARD protein,
a TrpA gene, a bacterium O23912• Classes represent the things we know in our domain
Properties
• Classes have properties that describe their nature• Properties held by the individuals in a class• Properties made by relationships to individuals in
other classes• Some properties must be held by a class• These are necessary to be a member of a class• Some properties are sufficient to define membership
of a class• These are sufficient to recognise an individual as
being a class member
Classes
• Primitive classes:
– properties are necessary– Globular protein must have hydrophobic core, but
a protein with a hydrophobic core need not be a globular protein
• Defined classes:
– properties are necessary + sufficient– Eukaryotic cells must have a nucleus. Every cell
that contains a nucleus must be Eukaryotic.
An explicit description of a domain
• Rather than arguing about meaning of words• We argue about characteristics of things• Experience shows writing a list of characteristics or properties
describing a “thing” saves much time• Computationally useful – gives a computer something to work
with…
animal
rodent cowcat
mouse
eats
dog
domesticvermin
Classification of the Classical Tyrosine Phosphatases
Incremental Addition of Protein Functional Domains
Phosphatase catalytic
Cadherin-like
Immunoglobulin
MAM domain Cellular retinaldehyde
Adhesion recognition Transmembrane
Fibronectin III Glycosylation
Determining Class Definitions for Phosphatases
R2A
- Contains 2 protein tyrosine phosphatase domains
- Contains 1 transmembrane domain
- Contains 4 fibronectin domains
- Contains 1 immunoglobulin domain
- Contains 1 MAM domain
- Contains 1 cadherin-like domain
Form complete OWL descriptions and clasify
What is the Ontology Telling Us?
• Each class of phosphatase defined in terms of domain composition
• We know the characteristics by which an individual protein can be recognised to be a member of a particular class of phosphatase
• We have this knowledge in a computational form• If we had protein instances described in terms of the
ontology, we could classify those individual proteins• A catalogue of phosphatases
Classification of Protein Tyrosine Phosphatases
So what is an ontology?
Catalog/ID
Thesauri
Terms/glossary
Informal Is-a
FormalIs-a
Formalinstance
Frames(properties)
General Logicalconstraints
Valuerestrictions
Disjointness,Inverse, partof
Gene Ontology
Mouse AnatomyEcoCyc
PharmGKB
TAMBISArom
[Deborah McGuinness, Stanford]
Types of Description
Class of Service: Domain description SWISS-PROT service
Abstract Service: Inputs, outputs, algorithm,
Service instance: As offered by EBI, NCBI Invoked instance: What was called
myGrid Ontologies
Bioinformatics ontology
Web serviceontology
Task ontology
Publishingontology
Informatics ontology
Molecularbiology ontology
Organisationontology
Upper levelontology
Specialises. All concepts are subclassed from those in the more general ontology.
Contributes concepts to form definitions.
Using a Service Ontology
Discovery of an appropriate Web Service within a registry by its properties and capabilities;
Invocation by some agent; Interoperability is increased by describing
the semantic type of inputs and outputs; Composition of new services; Verification of a service’s properties; Execution monitoring by tracking what is
happening to the described aspects of a service and its sub-services.
Service Classifications
Classifications of the descriptions act as:
an index to the descriptions; a filtering mechanism; a query containment mechanism; and a substitution mechanism such that partial
or imprecise matches can be catered for and a clustering mechanism for similar services.
Problems with ProfileThree Steps to Discovering & Preparing a ServiceOntologies & Services
Typing
Controlling inputs and outputs of services.
Mapping between WSDL / OGSA XML Schema types to (DAML+OIL) concepts
Classifying
Indexing services and data
OGSA – factories and service instances.
Organising services based on reasoning over the service descriptions.
A simple single axial ontology describing
sequence alignment services
Sequence alignment
Pairwise Multiple
SmithWaterman BLAST
BLASTn BLASTp tBLASTn
What do ontologies offer?
Common framework for integration OpenMMS, TAMBIS, ONION
Search support, querying & matching GO, MGED, UMLS, MeSH
Intelligent interfaces for queries and data capture Ingenuity web based products, TAMBIS.
Control + Semantics
Cop
yrig
ht ©
199
8 Pa
ngea
Sys
tem
s, Inc
. A
ll ri
ghts
res
erve
d.
What is Knowledge?
Knowledge – all inf ormation and an understanding to carry out tasks and to inf er new inf ormation
I nf ormation -- data equipped with meaning
Data -- un-interpreted signals that reach our senses
Protein kinase C
Michael AshburnerProfessor
University of CambridgeUK
IGF
NameJ ob
I nstitutionCountry
Conf
manacademic, senior
ancient university, 5 ratedEuropean
important fi gure in biology
BIOLOGY
Types of Description
Class of Service: Domain description SWISS-PROT service
Abstract Service: Inputs, outputs, algorithm,
Service instance: As offered by EBI, NCBI Invoked instance: What was called
myGrid Ontologies
Bioinformatics ontology
Web serviceontology
Task ontology
Publishingontology
Informatics ontology
Molecularbiology ontology
Organisationontology
Upper levelontology
Specialises. All concepts are subclassed from those in the more general ontology.
Contributes concepts to form definitions.
Using a Service Ontology
Discovery of an appropriate Web Service within a registry by its properties and capabilities;
Invocation by some agent; Interoperability is increased by describing
the semantic type of inputs and outputs; Composition of new services; Verification of a service’s properties; Execution monitoring by tracking what is
happening to the described aspects of a service and its sub-services.
Service Classifications
Classifications of the descriptions act as:
an index to the descriptions; a filtering mechanism; a query containment mechanism; and a substitution mechanism such that partial
or imprecise matches can be catered for and a clustering mechanism for similar services.
Problems with ProfileThree Steps to Discovering & Preparing a ServiceOntologies & Services
Typing
Controlling inputs and outputs of services.
Mapping between WSDL / OGSA XML Schema types to (DAML+OIL) concepts
Classifying
Indexing services and data
OGSA – factories and service instances.
Organising services based on reasoning over the service descriptions.
A simple single axial ontology describing
sequence alignment services
Sequence alignment
Pairwise Multiple
SmithWaterman BLAST
BLASTn BLASTp tBLASTn
What do ontologies offer?
Common framework for integration OpenMMS, TAMBIS, ONION
Search support, querying & matching GO, MGED, UMLS, MeSH
Intelligent interfaces for queries and data capture Ingenuity web based products, TAMBIS.
Control + Semantics
Cop
yrig
ht ©
199
8 Pa
ngea
Sys
tem
s, Inc
. A
ll ri
ghts
res
erve
d.
What is Knowledge?
Knowledge – all inf ormation and an understanding to carry out tasks and to inf er new inf ormation
I nf ormation -- data equipped with meaning
Data -- un-interpreted signals that reach our senses
Protein kinase C
Michael AshburnerProfessor
University of CambridgeUK
IGF
NameJ ob
I nstitutionCountry
Conf
manacademic, senior
ancient university, 5 ratedEuropean
important fi gure in biology
BIOLOGY
EcoCyc
Gene Ontology http://www.geneontology.org
Controlled vocabulary• AGROVOC: Agricultural Vocabulary
UMLS (Unified Medical Language System) http://umlsks.nlm.nih.gov/
• National Library of Medicine (NLM) database of medical terminology. Terms from several medical databases (MEDLINE, SNOMED International, Read Codes, etc.) are unified so that different terms are identified as the same medical concept.
• Metathesaurus provides the concordance of medical concepts: 730.000 concepts, 1.5 million concept names in different source vocabularies
• Specialist lexicon provides word synonyms, derivations, lexical variants, and grammatical forms of words used in MetaThesaurus terms: 130,000 entries.
• Semantic Network codifies the relationships (e.g. causality, "is a", etc.)
among medical terms: 134 semantic types, 54 relationships.
An Ontology Building Life-cycle
Identify purpose and scope
Knowledge acquisition
Evaluation
Language and representation
Available development tools
Conceptualisation
Integrating existing ontologiesEncoding
Building
Ontology Learning
Consistency Checking