InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
1 / 35
Mission impossible? Computer Aided
Extraction of Generic Chemical Structures
from Patents. A Critical Review of the
Technologies Applied and Some Results of
the Theseus-Project 'ChemProspector‘
Josef Eiblmaier, Valentina Eigner-Pitto, Hans Kraut, Larisa Isenko, Heinz Saller and Peter Loew
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
2 / 35
Outline
© cora / PIXELIO, www.pixelio.de
» Introduction
› ChemProspector, a THESEUS project
› Markush in a nutshell
» Goals and Approach
» Results
» Outlook
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
3 / 35
Introduction
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
4 / 35
» Research program initiated by the Federal Ministry of Economy and Technology (BMWi)
» ‘New Technologies for the Internet of Services’
» Supported with approx. 100 million Euros
» Duration: five years (2007 - 2011)
» Phase one: development of core technologies (2007 - 2008)
» Phase two: THESEUS SME (2009 - 2011)
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
5 / 35
ChemProspector: Basic Data
» Main emphasis:
‘The automatic extraction of Markush Structures
from patent documents‘
» Research SME-project within THESEUS research program
» Duration: July 2009 – end of 2011
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
6 / 35
What is a ‘Markush Structure‘?
http://www.colorantshistory.org
» Dr. Eugene A. Markush (1888-1968), Pharma Chemical Corporation (1917)
» USP No. 1,506,316 (1924), first usage of generic structures in a patent
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
7 / 35
Markush Structure Example
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
8 / 35
Approach
© Gerd Altmann / PIXELIO, www.pixelio.de
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
9 / 35
Basic assumptions
» Markush notations follow particular grammar rules
› Generic graphical core structure
› Definition of generic groups in the subsequent text
› Usage of ‘Markush specific’ phrases
» Markush-Structures can be categorized
› Level 1 (easy)
› Level 2 (medium)
› Level 3 (hard)
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
10 / 35
Challenges in Markush Structures
» The Information is contained in the text …
› Substituents variation
› Homology variation
› Topology variation
» ... in the images ...
› Position variation
› Frequency variation
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
11 / 35
Challenges in Markush Structures
» ... and both, text and images
› Frequency variation
› Bond variation
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
12 / 35
Classification of Markush Structures
» Level 1: simple standard notations (easy)
› relevant information is in the image or directly in the subsequent text
› simple grammar rules are used
› all variable parts are clearly defined
› variable parts do not contain further nested variable groups, generic organic groups
(e.g. alkyl) are allowed
» Level 2: complex standard notations (medium)
› relevant information is in the image or in the text, clear references to other places
in the document must be there
› more complex grammar rules may be used, but have to follow certain rules
› generic groups may have further nested generic groups but must be comprehensible
› may have conditional R-groups as long as they are clearly structured
and unambiguous
» Level 3: complex notations, singletons (hard)
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
13 / 35
Main Components for Markush Recognition
ICANNOTATOR
Markush-Parser
Image classifier
Chemical
recognition
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
14 / 35
» Proprietary development
› started mid 2010
» Multiple step process:
› page segmentation
› image classification
› vectorization, OCR, reconstruction
Extracts chemical images
Image Recognition: ICImg2Struct
Image classifier
Chemical
recognition
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
15 / 35
Extracts chemical named entities
Exact chemical entities methyl, ethyl, n-propyl, phenyl, chloro, nitro, amino,
hydroxy, hydrogen, carbon, 1-naphthyl, 2-pyridyl, tosyl,
piperidyl ...
Generic and homology
groups, fragments
alkyl, alkoxy, aryl, halogenid, hydrocarbon ...
Combinations alkylamino, 4-aryl-phenyl, ...
Named Entity Recignition: ICANNOTATOR
ICANNOTATOR
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
16 / 35
Extracts Markush specific entities
Formula definitions formula 1, general formula (I), derivatives
represented by (3), …
Variable definitions R, R1, R2, R', A, X, Y, Z, Ar, …
Wherein definitions where, wherein, in which, …
Link group represents, may be, one of, is selected from, …
Chain lengths 3-20 carbon atoms, …
Topologic definitions branched or unbranched, …
Bond types may contain double bonds, …
References as defined above, …
Substitutions optionally substituted by, …
Markush-Parser
Markush-Parser
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
17 / 35
Grammar
rule 1
Grammar
rule 2
Grammar
rule 3
Finds patterns of entities/
reassembles the components
SemanticParser
SemanticParser
FORMULA
DEFINITION
CHEMICAL
STRUCTURE
WHERIN
DEFINITION LIGAND LIST
CHEMICAL
STRUCTURE
CHEMICAL
STRUCTURE
CHEMICAL
STRUCTURE
WHERIN
DEFINITION
FORMULA
DEFINITION LIGAND LIST
FORMULA
DEFINITION LIGAND LIST
Grammar
rule n
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
18 / 35
SemanticParser Grammar Rules
» 135 patterns (regular expressions)
» 102 macros (sequences)
» 291 rules (Backus-Naur-Form)
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
19 / 35
Overall Workflow Approach
ICANNOTATOR
Markush-Parser
Image classifier
Chemical
recognition
Page seg-
mentation
OCR
SemanticParser
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
20 / 35
Results
© G. Altmann / PIXELIO, www.pixelio.de
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
21 / 35
Original Image ICImg2Struct Accuracy
100%
100%
96.3%
Results: Image Recognition
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
22 / 35
» Wavy bonds
» Atom numbers
» Brackets
» Circles
» Fused characters
» Charges
» Crossing bonds
» Variable bonds
» …
Image Recognition: More Challenges
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
23 / 35
Evaluation: Key Questions
» What is the distribution of Level 1, 2
and 3 within a defined test corpus?
» How many Markush-Struktures
are identified?
» How many correct core structure
are identified?
» How many totally correct Markush-
Structures are extracted?
© Rainer Sturm / PIXELIO, www.pixelio.de
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
24 / 35
Evaluation: Test Set of Documents
» Random selection of 100 patent documents
» Manual abstraction of Markush-Structures
contained therein
» Level 1: 474 Markush-Structures
» Level 2: 453 Markush-Structures
» Automatic comparison
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
25 / 35
Results: Document Classification
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
26 / 35
Results: Output ICF Proprietary Format
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
27 / 35
Example 2: Incomplete Recognition
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
28 / 35
Example 3: Full Recognition
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
29 / 35
Example 4: Full Recognition
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
30 / 35
Evaluation: Results
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
31 / 35
Conclusions
Mission impossible?
Not impossible …
… but not accomplished yet!
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
32 / 35
Outlook
» Extension of grammar rules for Markush Structures
» Further development of CDX extraction
» Further development of ICImg2Struct
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
33 / 35
Acknowledgements
» InfoChem ChemProspector team
» German Federal Ministry of Economy and Technology (BMWi)
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
34 / 35
© P. Storz / PIXELIO, www.pixelio.de
Thank you!
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012
35 / 35
Questions?