Date post: | 21-Jan-2016 |
Category: |
Documents |
Upload: | bertha-freeman |
View: | 216 times |
Download: | 0 times |
ChEBI,text mining
and ontological best practice
Colin BatchelorRoyal Society of Chemistry
2
What is text mining?
Marti Hearst, Berkeley:“Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.”
Can ChEBI help?
3
Overview
Reasoning
ChEBI as dictionary
Regular polysemy in chemistry
Some possible solutions
4
Reasoning
5
Reasoning
Reasoning is using the logical structure of an ontology to automatically infer facts about the world which have not been explicitly added by a human being.
Computers have no real-world knowledge beyond what we tell them.
6
Logical structure:properties of relations
We only have time to look at transitivity and is_a.
Smith et al., “Relations in Biomedical Ontologies”, Genome Biol., 2005, 6, R46.
Relation Transitive Symmetric Reflexive Anti-symmetric
is_a Yes No Yes Yes
part_of Yes No Yes Yes
7
ChEBI’s is_a is not transitive (1)
If a relation R is transitive, then:
If a R b and b R c, then a R c.
glutathione is_a cofactor cofactor is_a biological role
therefore glutathione is_a biological role
8
ChEBI’s is_a is not transitive (2)
water is_a amphiprotic solvent amphiprotic solvent is_a protophilic solvent (*) protophilic solvent is_a Bronsted base (*) Bronsted base is_a base base is_a biological role
therefore water is_a basetherefore water is_a biological role
* how come “protophilic solvent” and “Bronsted base” only have one child each?
9
ChEBI’s is_a is not transitive (3)
N-hydroxy-L-aspartic acid is_a hydroxamic acids
hydroxamic acids is_a organic functional classes
therefore N-hydroxy-L-aspartic acid is_a organic functional classes
10
is_a has many meanings!
1. An amount of a compound has a biological role: tris is_a buffer.*
2. An amount of a compound has an application: sodium dodecyl sulfate is_a detergent.*
3. A less-abstract type is an example of a more abstract type: propane is_a alkanes.
4. ?!: metals is_a atoms.*
* Not a property of a lone atom or molecule!
11
Computers need facts about the world, not about ChEBI curation
12
ChEBI as dictionary
13
Evaluating name–structure conversion with ChEBI
ChEBI release 37 (26 September 2007) contains 12688 annotated entities, of which 8486 have InChI strings.
We use OSCAR3 (oscar3-chem.sourceforge.net) for name–structure conversion.
We convert chebi.obo to an XML file, each paragraph containing either a ChEBI name or an IUPAC name.
The layered structure of the InChI lets us give partial credit for incomplete matches.
14
Results: IUPAC names
Total 8447
Identified as chemical 8255 (97.73%)
With InChI (upper bound) 1810 (21.43%)
Matching InChI, disregarding fixed hydrogen layer 1734 (20.53%)
Matching InChI, disregarding stereo 1176
Matching InChI, exact (lower bound) 1174 (13.90%)
Not all of name matched 1024
Name identified as two or more separate names 974 (11.53%)
15
Results: ChEBI names
Total 8146
Identified as chemical 7173 (88.06%)
With InChI (upper bound) 1036 (12.72%)
Matching InChI, disregarding fixed hydrogen layer 953 (11.70%)
Matching InChI, disregarding stereo 637
Matching InChI, exact (lower bound) 628 (7.71%)
Not all of name matched 764
Name identified as two or more separate names 373 (4.58%)
16
Regular polysemy
17
Regular polysemy
… where words stand for multiple things in a consistent way.
Examples: Brand names Grinding Figure–ground Exact–class–part polysemy in chemistry
Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.
18
Regular polysemy
Brand names“Learning to buy a Renault and talk to BMW”
Grinding“The squirrel scampered down the path and kept
stopping and looking at the officers to check they were behind”
vs.“[…] the trick was to serve squirrel fresh and not to
leave it hanging like other game”
19
Regular polysemy
Figure–ground Audrey Hepburn painted the door (figure) Audrey Hepburn walked through the door
(ground) The Incredible Hulk walked through the
door (ambiguous)
20
Methyl, the radical (exact)
21
Methyl, the group (part)
22
Can ChEBI handle methyl?
methyl group (CHEBI:32875) YESmethyl radical (CHEBI:29309) YES
23
Imidazole (exact)
24
An imidazole (class)
25
imidazole side-chain/group/ring (part)
26
Can ChEBI handle imidazole?
imidazoles (CHEBI:24780) YESimidazole (CHEBI:16069) YES
imidazole ring not yetimidazolyl group not yet
27
Mapping exact, class and part to entries in ChEBI
Tests:1. Has InChI: exact2. Name is plural: class3. Ends in –yl, “group” or “residue”: part
Test 2 doesn’t work for applications or roles.Test 3 is brittle.
I would much rather use the logical structure of the ontology.
28
Some possible solutions
29
Some possible solutions (1)
ChEBI must represent facts about the world rather than about itself.
Examples: If unclassified compounds have a structure, they
should be in the molecular structure tree rather than the unclassifieds tree.
“organic functional classes” is a tool for assigning nomenclature. No chemical compound is an “organic functional class”.
30
Some possible solutions (2)
ChEBI must distinguish between what is always true and what is only sometimes true.
Example: Replace some is_a relationships with
has_biological_role and has_application.
We need ChEBI to represent parts of molecules that aren’t substituents. They should all be descendants of molecular part (a new term), as should amino acid residues and nucleoside residues.
31
Questions?