The 250th ACS National Meeting (Boston, MA)
Programmatic Access to Chemical Information
in PubChem
Sunghwan Kim, Paul Thiessen, Evan E. Bolton and Stephen H. Bryant
National Center for Biotechnology InformationNational Library of MedicineNational Institutes of Health
The 250th ACS National Meeting (Boston, MA)
Acknowledgements
Stephen BryantEvan E. BoltonLewis GeerYanli WangAsta GindulyteLianyi HanBo Yu
Paul Thiessen*Jian ZhangJiyao WangRenata GeerBen ShoemakerJane HeJie Chen
Tiejun ChengGang FuLeonid ZaslavskyTakako TakedaMing HaoAmrita Roy Choudhury
The PubChem Team
PubChem depositors, users, and collaborators
Funded by the National Library of Medicine
The 250th ACS National Meeting (Boston, MA)
PubChem(https://pubchem.ncbi.nlm.nih.gov)
The 250th ACS National Meeting (Boston, MA)
PubChem (https://pubchem.ncbi.nlm.nih.gov)
A “public” repository of information on small molecules and their biological activities, developed and maintained by the U.S. National Institutes of Health (NIH).
Launched in 2004 as a part of the Molecular Libraries Roadmap initiatives.
A key resource of chemical information for researchers in the area of cheminformatics, chemical biology, medicinal chemistry, and many others.
The 250th ACS National Meeting (Boston, MA)
PubChem (https://pubchem.ncbi.nlm.nih.gov)
The 250th ACS National Meeting (Boston, MA)
PubChem (http://pubchem.ncbi.nlm.nih.gov)
PubChem contains:
• >157 million substance descriptions,• >60 million unique chemical structures,• >229 million biological test results• >1 million biological assays, covering ~10,000 unique protein
sequence targets.
(Arguably) the largest corpus of publicly available chemical information from
more than 340 data contributors.
The 250th ACS National Meeting (Boston, MA)
ProgrammaticAccess to PubChem
EntrezUtilites(E-Utils)
Power User Gateway
(PUG)
PUG-SOAP
PUG-REST
PubChemRDF REST interface
Programmatic Access to PubChem
The 250th ACS National Meeting (Boston, MA)
ReferencePUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem.S. Kim, P.A. Thiessen, E.E. Bolton, & S.H. BryantNucleic Acids Res. 2015, 43(W1):W605-11.
Programmatic Access to PubChem
The 250th ACS National Meeting (Boston, MA)
Entrez
NCBI’s database search and retrieval system. Integrates NCBI’s >40 databases into a tightly interlinked
system. Provides an integrated view of biomedical data and their
relationships.
The 250th ACS National Meeting (Boston, MA)
Entrez Utilities (E-Utilities or E-Utils)
A suite of tools that provides access to nearly all Entrezfunctionality, primarily through an XML over HTTP interface
Not Developed and maintained by PubChem
http://www.ncbi.nlm.nih.gov/books/NBK25497/#_chapter2_The_Nine_Eutilities_in_Brief_
EInfodatabase statistics
ESearchtext searches
EPostUID uploads
ESummarydocsum download
EFetchdata record download
ELinkEntrez links
EGQueryglobal query
ESpellspelling suggestions
ECitMatchbatch citation search
The 250th ACS National Meeting (Boston, MA)
Entrez Utilities (E-Utilities or E-Utils)
Suited for accessing text or numeric-fielded data No ability to handle complex data types
specific to PubChem• Chemical structures• Tabular bioactivity data
The 250th ACS National Meeting (Boston, MA)
Power User Gateway (PUG)
Provides programmatic access to PubChem Services via a single common gateway interface (CGI), available at the URL:
http://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi
Exchanges data through a relatively complex XML schema, over the Hypertext Transfer Protocol (HTTP).
Examples of PUG-enables services are:• Substance/Compound download• BioAssay data download• Structure standardization service• Chemical structure search• Score matrix service• Identifier exchange service
The 250th ACS National Meeting (Boston, MA)
Power User Gateway (PUG)
Each PUG-supported service has its own input/output.
Different XML for different services, defined in:• http://pubchem.ncbi.nlm.nih.gov/pug/pug.dtd• http://pubchem.ncbi.nlm.nih.gov/pug/pug.xsd
PUG XML can also be used to import and export queries within supported service web pages.
The 250th ACS National Meeting (Boston, MA)
Power User Gateway (PUG)
Most PUG requests are queued, allowing users to submit a number of long-running tasks.
The initial PUG response contains a 64-bit identifier for a requested task, and must be used in any further communication with PUG concerning that task.
Upon request, PUG will check the status of a task given its identifier, returning the results of the task if completed or the status of the task if not completed.
The 250th ACS National Meeting (Boston, MA)
PUG-SOAP Uses the simple-object access protocol (SOAP).
Much of the same functionality as PUG, broken down into simpler functions, as defined via the web service definition language (WSDL; http://www.w3.org/TR/wsdl), using SOAP formatted message envelopes for information exchange.
The WSDL for PUG-SOAP can be found at:http://pubchem.ncbi.nlm.nih.gov/pug_soap/pug_soap.cgi?wsdl.
Most suitable for SOAP-aware GUI workflow applications (Taverna and Pipeline Pilot) and programming languages. (C, C++, C#, .NET, Perl, Python, and Java)
The 250th ACS National Meeting (Boston, MA)
“Keys” in PUG-SOAP Simple strings that store data objects
• Structure keys for a single chemical structure.• List keys for a set of identifiers (SIDs, CIDs, AIDs)• Assay keys for a set of rows and coloums from an assay table.• Download keys for an download URL (usually FTP)
Used for exchanging data between the PUG-SOAP server and Client application.• Avoids sending/receiving intermediate results• Allows one to readily chain queries between different PubChem
services• Reduces bandwidth requirements.
The 250th ACS National Meeting (Boston, MA)
• Specifies the input structure and ID list• Synchronous
InputFunctions
• Performs supported operation on the input• Asynchronous
ProcessingFunctions
• Retrieves the results• Synchronous
OutputFunctions
“Functions” in PUG-SOAP
The 250th ACS National Meeting (Boston, MA)
Asynchronous functions in PUG-SOAP
InputStructure()
SMILES(c1ccccc1)
Structure key733…801
IdentitySearch()
List key457…843
Download()
Download key397…976
FTP URL(ftp://......)
GetDownload
URL() GetOperation
Status()
GetOperation
Status()
The 250th ACS National Meeting (Boston, MA)
PUG-REST Representational State Transfer (REST)-style
interface.
Simplified access route without the overhead of XML or SOAP envelopes
Access to data that are not accessible through other PUG Services.
Intended to handle short, synchronous requests (<30 seconds).
The 250th ACS National Meeting (Boston, MA)
Conceptual Framework of a PUG-REST request
PUG-REST Serverat PubChem
User’sComputer
① INPUTIdentifiers(CIDs, SIDs, AIDs)
③ OUTPUTResults in a desired format
② OPERATIONwith identifiers
All necessary information is encoded into a one-line URL.
The 250th ACS National Meeting (Boston, MA)
http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]
Prolog(common to all PUG REST requests)
Options specific to some operations
<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......
<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations
<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT
URL construction for a PUG-REST request
The three parts are (mostly) independent of each other.→ Many possible requests in a PUG-REST request.
The 250th ACS National Meeting (Boston, MA)
http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]
Prolog(common to all PUG REST requests)
Options specific to some operations
<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......
<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations
<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT
URL construction for a PUG-REST request
① http://...... /compound/cid/2244,1983/record/XML?record_type=3d
→ Retrieve in XML full records for CIDs 2244 and 1983, including 3-D structure description.
The 250th ACS National Meeting (Boston, MA)
http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]
Prolog(common to all PUG REST requests)
Options specific to some operations
<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......
<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations
<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT
URL construction for a PUG-REST request
② http://....../compound/smiles/C(C(=O)O)N/property/TPSA,XLogP/CSV
→ Retrieve in CSV the TPSA and XLogP values for compounds whose smile string is “C(C(=O)O)N”.
TSPA: topological polar surface area
The 250th ACS National Meeting (Boston, MA)
http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]
Prolog(common to all PUG REST requests)
Options specific to some operations
<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......
<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations
<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT
URL construction for a PUG-REST request
③ http://....../compound/name/lipitor/record/PNG?record_type=2d&image_size=large
→ Download the large image of the 2-D structure of Lipitor in PNG.
PNG: portable network graphics
The 250th ACS National Meeting (Boston, MA)
http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]
Prolog(common to all PUG REST requests)
Options specific to some operations
<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......
<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations
<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT
URL construction for a PUG-REST request
④ http://....../substance/xref/PatentID/US6127355/sids/TXT
→ Retrieve Substances that are mentioned in U.S. Patent US6127355 in TXT.
The 250th ACS National Meeting (Boston, MA)
http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]
Prolog(common to all PUG REST requests)
Options specific to some operations
<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......
<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations
<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT
URL construction for a PUG-REST request
⑤ http://....../assay/aid/640/cids/XML?cids_type=active
→ Retrieve in XML compounds that are tested to be active in AID 640.
The 250th ACS National Meeting (Boston, MA)
http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]
Prolog(common to all PUG REST requests)
Options specific to some operations
<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......
<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations
<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT
URL construction for a PUG-REST request
⑥ http://……/assay/aid/490,1000/targets/ProteinName,GeneSymbol/XML
→ Retrieve in XML the protein name and gene name targeted in AIDs 490 and 1000.
The 250th ACS National Meeting (Boston, MA)
http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]
Prolog(common to all PUG REST requests)
Options specific to some operations
<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......
<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations
<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT
URL construction for a PUG-REST request
⑦ http://……/assay/aid/504526/doseresponse/CSV?sid=104169547
→ Retrieve in CSV dose-response data for SID 104169547 tested in AID 504526.
The 250th ACS National Meeting (Boston, MA)
http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]
Prolog(common to all PUG REST requests)
Options specific to some operations
<INPUT>Specifies identifiers of interest,•by identifiers•by chemical name•by chemical structure search•by cross reference•by listkey, ......
<OPERATION>Specifies what to do with input•get full records•get molecular properties•get synonyms or images•get cross references•many other operations
<OUTPUT>Specifies desired output format•XML • PNG• JSON • SDF• JSONP • CSV•ASNB • TXT•ASNT
URL construction for a PUG-REST request
⑧ http://....../assay/target/gi/66528677/concise/CSV
→ Getting a concise view of assays targeting protein GI 66528677(glucocorticoid receptor isoform γ)
The 250th ACS National Meeting (Boston, MA)
Search by chemical name Exact match (Compound whose name is aspirin)
→ Identical to “aspirin[CompleteSynonym]” in Entrez
http://....../compound/name/aspirin/cids/TXT?name_type=complete
Partial match (Compounds whose name contains aspirin)
→ Identical to “aspirin[Synonym]” in Entrez
http://....../compound/name/aspirin/cids/TXT?name_type=word
The 250th ACS National Meeting (Boston, MA)
Requests with conflicts with URL syntax Multi-line SDF file Chemical names, SMILES, InChI strings
with special characters reserved in the URL syntax(ex) a forward slash (“/”)• C(=C/F)\F
• InChI=1S/C2H2F2/c3-1-2-4/h1-2H/b2-1+
A very long lists of identifiers in the URL
Use HTTP POST
> curl -H 'Content-Type: application/x-www-form-urlencoded' -d "inchi=InChI=1S/C3H8/c1-3-2/h3H2,1-2H3"http://PROLOG/rest/pug/compound/inchi/cids/TXT
The 250th ACS National Meeting (Boston, MA)
Asynchronous jobs A standard time limit of 30 seconds per web service
request
Some tasks may take longer than 30 seconds.
(ex) Chemical structure search including:
• identity search• substructure search• superstructure search• similarity search• molecular formula search
→Used an asynchronous approach using a list keys.
“synchronous” alternatives are now available.
The 250th ACS National Meeting (Boston, MA)
Asynchronous jobs Any operation that results in a list of SIDs/CIDs/AIDs can
be stored in a list key on the server side
A list key can be retrieved by subsequent requests.
Helpful when chaining requests.
A list key expires after 8 hours of inactivity.
The 250th ACS National Meeting (Boston, MA)
Request volume limitations
PUG-REST is NOT designed for very large volumes of requests(e.g. millions requests)
Any script or application should not make more than five requests per second to avoid overloading the PubChem servers.
If you have a large data set to process, please contact us for help on optimizing your task.
The 250th ACS National Meeting (Boston, MA)
Entrez Utilities For accessing textual/numeric data.
Power User Gateway Pure XML-based interface. Uses a complex PubChem-specific XML schema.
PUG-SOAP Uses the Simple-Object Access Protocol. Good for scripting/programming languages with SOAP interface.
PUG-REST Representational State Transfer-style interface. The simplest and easiest to use and learn.