Chemical named entity recognition and literature mark-up

Post on 07-Jan-2016

33 views 0 download

Tags:

description

Chemical named entity recognition and literature mark-up. Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org. Overview. Project Prospect: what we find and how we find it. RDF: How should we be disseminating it? Next steps: Basics for a chemical ontology. - PowerPoint PPT Presentation

transcript

Chemical named entity recognition and literature mark-upColin BatchelorInformatics DepartmentRoyal Society of Chemistrybatchelorc@rsc.org

2

Overview

Project Prospect: what we find and how we find it.

RDF: How should we be disseminating it?

Next steps: Basics for a chemical ontology.

3

4

5

6

7

8

9

Project Prospect: What do we find?

Chemical compounds Chemical terms from the IUPAC Gold Book

Gene products: function, process, location Nucleotide and polypeptide sequence terms Cell types

10

Project Prospect: How do we find it?

For compound names:~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and

Corbett 2007)

~20% PubChem~20% ChemDrawFor compound numbers:~70% author ChemDraw~30% editors

11

12

RDF in an RSS reader

13

RDF: how we do it now

Content module from RSS 1.0

http://web.resource.org/rss/1.0/modules/content

In what sense does an article “contain” pyridine or base pairs?

We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”.

14

RDF: what it looks like now

<item rdf:about=http://xlink.rsc.org/?DOI=b716356h&amp;RSS=1><title> [… title] </title><link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link><description> [… blah] </description><content:encoded> [… human-readable stuff</content:encoded>[… dublin core stuff …]<content:items> <rdf:Bag> <rdf:li>

<content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1"/></rdf:li><rdf:li><content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/></rdf:li>

</rdf:Bag></content:items></item>

15

Basics for a chemical ontology

1. Unambiguous representation of objects of chemical discourse

2. Proper parthood relations

16

Basics for a chemical ontology:1. Objects of chemical discourse

Must be able to represent and clearly distinguish

Compounds Classes of compound Parts of molecules Mixtures

Would be nice to have:

Disambiguation cues for the first three

17

Imidazole

18

An imidazole

19

The imidazole side-chain/group/ring

20

Can ChEBI handle this?

Imidazoles (!) (CHEBI:24780) Imidazole (CHEBI:16069)

Imidazole ring not yet Imidazolyl group not yet (but methyl, benzyl, etc.)

… and there are no disambiguation cues

21

Disambiguation

One Sense per Discourse (Gale et al. 1992)

… this doesn’t hold at all

One Sense per Collocation (Yarowsky 1993)

… matches our intuitions

22

Disambiguation:What a one sense per collocation feature set might look like

CLASS:w(–1) = a, an, the, thisw(0) plural (bit of a cheat, as not a collocation)

PART:w(–1) = bridging, terminalw(+1) = backbone, bridge, chain, core, dyad,

fluorophore, fragment, framework (and many more)

w(+1)w(+2) = “building block”, “protecting group”, “side chain”

23

Basics for a chemical ontology:2. Parthood relations

Parthood in ChEBI means at least three things:

is necessarily chemically part of

carbonyl group part_of carbonyl compounds

24

Basics for a chemical ontology:2. Parthood relations

Is possibly chemically part of:

Lead(2+) part_of lead diacetate

(most lead(2+) isn’t)

Electron part_of muonium (!)

25

Basics for a chemical ontology:2. Parthood relations

Is part of a mixture

Kanamycin A part_of kanamycin

26

Basics for a chemical ontology:2. Parthood relations

Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al., “Relations in biomedical ontologies”, 2005)

carbonyl compound has_part carbonyl group Lead diacetate has_part lead(2+) (?!) Muonium has_part electron Kanamycin has_part kanamycin A (?!)

27

Basics for a chemical ontology:2. Parthood relations

Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships

Carbonyl compound molecule has_part carbonyl substituent

Muonium atom has_part electron

Kanamycin has_component kanamycin A Lead diacetate has_component lead(2+) (?!)

28

Open questions

How do we represent the relationship between named entities and documents?

How do we integrate ontologies and word-sense disambiguation?

What is the best way of distinguishing molecules and samples?

29

Acknowledgements

University of Cambridge: Peter Corbett

OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo)

www.projectprospect.org

30

Open questions

How do we represent the relationship between named entities and documents?

How do we integrate ontologies and word-sense disambiguation?

What is the best way of distinguishing molecules and samples?