Post on 10-Sep-2020
transcript
64.2%55.7%
54.5%0
50
100
150
200
250
300
350
400
450
Reaxys SureChEMBL IBM-SIIP
Number of compounds
"SureChEMBL and IBM set"
"SureChEMBL only set"
59.0% 50.6%
60.8 49.40
500
1000
1500
2000
2500
3000
SciFinder SureChEMBL IBM-SIIP
Number of compounds
All compoundsBiologically relevant
Comparison of automated and manual patent chemistry extraction methodsLuca Bartek*, Stefan Senger, George Papadatos, Anna Gaulton
Introduction
Results
Conclusion
Methods Discussion
References
As new chemical entities are often first published in patents,and some new compounds may not even be featuredelsewhere, patents have become an important source ofinformation for researchers.
With more and more patents granted each year, it becomesincreasingly difficult to extract the chemistry manually. Thereare automated options, including SureChEMBL, which isavailable via the Open PHACTS discovery platform. But howreliable are they compared to manually curated sources? Welooked at the following use cases:
Use case 1:
In the second comparison, we used the 1740 unique patent-compound pairs we had retrieved from Reaxys. We looked howmany of these patent-compound pairs we would also find inSureChEMBLand IBM SIIP, respectively.
Another interesting question is the source of thecompounds – whether they are present in the patent astext, structural depictions or Markush structures. When wecompared the subset of WO patents for which imageswere recognised to those which were not, we only found a7.7% increase in efficiency which is less than what wehad expected. This could be because today, automatedsystems have no way of recognising Markush structures,which are in fact very common in the patent literature.
In our binary comparison, we found that chemicals with ahigher patent corpus count was much more likely to befound in either of the automatically created databases.This was in line with our expectations. Though, one mightargue that “unique” compounds are more relevant – thatis, those with a low corpus count.
We also found that there was a vast difference in thesuccess rates of one- and multi-component compounds.When only looking at single-component structures, thesuccess rate of SureChEMBL was over 80%, while forcompounds containing more than two components, it was0%.
We noticed that the highest success rate was achievedwith US patents, therefore we decided to extend thesearch to patent families to examine whether alternativepatent numbers could improve the results. After retrievingall US, WO and EP patent family members of the patentsretrieved from Reaxys (this was done usingSureChEMBL), we only found a moderate increase in thesuccess rate of both SureChEMBLand IBM SIIP.
On average, 50-66% of the “gold standard” manuallycurated patent chemistry database content can also befound in automatically generated databases. These latterdatabases are also freely available, for example,SureChEMBL will soon be available through the OpenPHACTS api (http://dev.openphacts.org). IBM SIIP is alsofreely available, however it is a static database coveringpatents until 2010, whereas SureChEMBL is updateddaily.
1, Senger et al., J. Cheminf, 2015, 7:492. Akhondi et al., 2014, PLoSOne 9:e1074773. http://www.uspto.gov/
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
1963 1973 1983 1993 2003 2013
USPTO Grants
PATENT
COMPOUND 1
COMPOUND 2
COMPOUND 3
COMPOUND
PATENT 1
PATENT 2
PATENT 3
Use case 2:
PATENT COMPOUNDS
46 PATENTS(Akhondi et al.)
SciFinderCOMPOUNDS
SureChEMBLCOMPOUNDS
IBM SIIPCOMPOUNDS
automatically generated
manuallycurated
COMPOUND PATENTS
Maybridge HitfinderCollection# heavy atoms>19
mw<500
9274 COMPOUNDS
Reaxys Patents (for 543 compounds)At least 1 US, WO or EP
patent
Incorrect or ambiguous
“SureChEMBL only”
“SureChEMBL&IBM”438 compounds
PATENT COMPOUNDS
When stereochemistry was removed, the results somewhatimproved, with SureChEMBL returning 64.9% of the SciFindermolecules. This number was 67.1% for the “Biologicallyannotated” subset.
COMPOUND PATENTS
The first comparison we performed was of a binary nature –we looked at whether a compound was found at all in theSureChEMBLand IBM SIIP databases.
From the 438 compounds, 67.1% was found in at least oneof the two databases, 52.7% were found in both, 2.9% wasfound only in IBM SIIP and 11.4% was found only inSureChEMBL.
61.6% 59.3%
0
200
400
600
800
1000
1200
1400
1600
1800
Reaxys SureChEMBL IBM-SIIP
Num
ber of patent -compound
pairs
PATENT COMPOUNDS
The about 60% efficiency of SureChEMBL would most likelyseem low to the researcher who expects every singlecompound of interest to be extracted from each patent. Thisis the reason it is surprising that the coverage was notgreatly increased for “Biologically annotated” molecules.But what compounds are of interest?
SureChEMBL returned nearly 5 times more compounds than SciFinder.
What is noise?
COMPOUND PATENTS
* luca.bartek1@gmail.com
The research leading to these results has received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement n° 115191, resources of which are composed of financial contribution from the European Union's Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in kind contribution.