+ All Categories
Home > Science > Chemical structure representation in PubChem

Chemical structure representation in PubChem

Date post: 23-Jan-2017
Category:
Upload: nextmove-software
View: 148 times
Download: 1 times
Share this document with a friend
27
Chemical structure representation in pubchem Roger Sayle NextMove Software, Cambridge, UK 252 nd ACS National Meeting, Philadelphia, PA, Tuesday 23 th August 2016
Transcript

Chemical structure representation in pubchem

Roger Sayle

NextMove Software, Cambridge, UK

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Selected Pubchem publications

• Sunghwan Kim, Paul A. Thiessen, Evan E. Bolton, Jie Chen, Gang Fu, Asta Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A. Shoemaker, Jiyao Wang, Bo Yu, Jian Zhang and Stephen H. Bryant, “PubChem Substance and Compound Databases”, Nucleic Acids Research, 2015.

• Volker D. Hahnke, Evan E. Bolton and Stephen H. Bryant, “PubChem atom enironments”, Journal of Cheminformatics, 7:41, 2015.

• Evan E. Bolton, Yanli Wang, Paul A. Thiessen, Stephen H. Bryant, “PubChem: Integrated Platform of Molecule Molecules and Biological Activities”, Annual Reports in Computational Chemistry, Volume 4., Chapter 12, pp. 217-241, 2008.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Substance and compound

• A unique and invaluable feature of PubChem’s architecture is the distinction between the deposited structures (substances) and the normalized structures (compounds), and the retention of both.

• Pubchem Substance contains ~209.6M structures.

• Pubchem Compound contains ~91.7M structures.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Molecular identity

• When are two chemical structures the same?

– Alternate chemical representations.

– Aromaticity and conjugation.

– Protonation states and tautomerism.

– Errors and typographical mistakes.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Pubchem standardization service

https://pubchem.ncbi.nlm.nih.gov/standardize/standardize.cgi

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

example 1: ethanol

• PubChem CID 702 has been deposited 1569 times with six different explicit atom counts.

– 1311 have 9 atoms and 8 bonds.

– 249 have 3 atoms and 2 bonds.

– 4 have 0 atoms and 0 bonds.

– 2 have 4 atoms and 3 bonds.

– 2 have 5 atoms and 4 bonds.

– 1 has 7 atoms and 6 bonds.

• All have same SMILES (“CCO”) and InChI.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Explicit vs. implicit hydrogens

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

example 2: nitrobenzene

• Pubchem CID 7416 has been deposited as 164 distinct substance depositions (2 without structures).

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Mdl molfile-ageDdon

• Biovia 2017 changed the interpretation of CT files.

• This affects 342,689 SIDs and 213,097 CIDs.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Hydrogens: easy come/easy go?

• PubChem is inconsistent on protonation/hydrogens.

• Common organic element radicals are hydrogenated:

– [C] → C, [Cl] → Cl, [P] → P, [S] → S, [H] → [HH]

– [Li], [Be], [B], [Si], [As], [Se], [At], etc. remain unchanged.

• Some groups get deprotonated

– c1ccccc1[N+](=O)O → c1ccccc1[N+](=O)[O-]

• But generally protonation state is preserved

– CC(=O)O, CC(=O)[O-], [NH4+], [NH3+]CC(=O)[O-]

– C[N+](C)(C)O

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Example 3: o-xylene

• A major challenge in chemical databases is aromaticity; that two compounds that differ in Kekule forms are the same molecule.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

CID 7237

Pubchem canonical kekule smiles

• A significant novel innovation in cheminformatics was Evan Bolton’s development of a “canonical” Kekulé SMILES form of a molecule.

• Different chemistry toolkits (and chemists!) differ in opinion on which ring systems are aromatic and which are not, hence PubChem’s wish to remain “neutral” by only providing non-aromatic SMILES.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Bolton’s algorithm

• Steps of Bolton’s Canonical Kekulé Form Algorithm:

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Tricky case: 10b,10c-dihydropyrene

• An important aspect is to aromatize all conjugated cycles, not just those associated with SSSR.

• Unfortunately, this computationally demanding requirement is a source of pain at the NCBI.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Conjugated ring systems

• Does it make sense to distinguish 4n+2 Hückel aromaticity from conjugated ring systems?

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Resonance forms

• CCN(=O)=O → CC[N+](=O)[O-]

• CCN=N#N → CCN=[N+]=[N-]

• CC[O+]=C=[N-] → CCOC#N

• C[P+](C)(C)[O-] → CP(=O)(C)C

• CC(=[NH2+])[O-] → CC(=O)N

• CS(=[OH+])(=O)[O-]

• C[S+2]([O-])([O-])C

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Tautomers are normalized

• CC(=N)O → CC(=O)N

• CC(=[NH2+])[O-] → CC(=O)N

• n1ccccc1O → [nH]1ccccc1=O

• n1ccc(O)cc1 → [nH]1ccc(=O)cc1

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Classic tautomerism: laar 1886

InChI=1S/C16H12N20/c19-16-11-10-15(13-8-4-5-9-14(13)16)18-17-12-6-2-1-3-7-12/h1-11,19H InChI=1S/C16H12N20/c19-16-11-10-15(13-8-4-5-9-14(13)16)18-17-12-6-2-1-3-7-12/h1-11,17H

CID 5355205 (CAS 3651-02-3)

5 SIDs 13 SIDs

But things could be improved...

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Bonds to metals

• PubChem follows InChI breaking bonds to metals.

– Table salt • [Na]Cl → [Na+].[Cl-]

• [Na].[Cl] → [Na].Cl

– Zirconium(IV) ethoxide • CCO[Zr](OCC)(OCC)OCC → [Zr].CCO.CCO.CCO.CCO

• [Zr+4].CC[O-].CC[O-].CC[O-].CC[O-]

– Grignard reagents • c1ccccc1[Mg]Br → c1cccc[c-]1.[Mg+2].[Br-]

• c1ccccc1[Mg+].[Br-] → c1cccc[c-]1.[Mg+].[Br-]

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Periodic table (circa 1997-2003)

• PubChem currently handles 109 of the 118 elements in the periodic table [to be ratified in 2016].

• Hence “Mt” is the heaviest element at the moment.

• “Ds”, “Rg”, “Cn”, “Fl”, “Lv” already ratified.

• “Nh”, “Mc”, “Ts” and “Og” expected soon.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Pubchem Isotopes

• PubChem registration confirms that any specified isotope has been observed experimentally.

• Hence [7CH4] is rejected, but [8CH4] is allowed.

• Interestingly, the [8CH4] of CID 11635947 has a half-life of only two zeptoseconds (2×10-19 seconds).

• Another quirk is that PubChem doesn’t normalize mononuclidic isotopes. Hence [19F]C (CID58338844) is the sames as FC (CID11638).

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Disavowed by the government

• There are a number of species PubChem rejects:

– Chlorine dioxide O=[Cl]=O

– Carbide anions: [C-]#[C-] and [C-4]

• But there is hope…

– Disulfur dioxide: O=[S][S]=O → O=S=S=O

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Related compounds/substances

• CID → SID – Same Connectivity, Same Stereochemistry, Same Isotopes

– Same Parent Connectivity, Same Exact Parent

– Mixtures, Components and Neutralized Forms

– Unique Components

– Similar Compounds (90% Tanimoto), Similar Conformers

• CID → SID – All, Same Structure, Mixture

• SID → SID – Same Connectivity, Same Exact

• SID → CID – PubChem SID

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Pubchem bond encoding

• PubChem allows depositors to specify advanced representations of molecular structures such as inorganics and organometallics via SD tags.

• PUBCHEM_NONSTANDARDBOND

– 4 = Quadruple bond, 5 = Dative bond, 6 = Complex bond, 7 = Ionic bond.

• PUBCHEM_BONDANNOTATIONS

– 2 = Hydrogen bond, 9 = Resonance bond, 10 = Bold bond, 11 = Fischer bond, 12 = Close contact.

• Relatively few depositors make use of these.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

Final thoughts: abstract

For all of the grief that I give Evan, often over corner cases of chemical semantics that only one or two people care about, it is fair to say that PubChem represents the current state-of-the-art in chemical structure representation. Nobody does it better. Under the surface, unseen to most users, are a large number of technical and scientific innovations that have enabled PubChem to scale over the past decade and a half to now contain approaching 100 million compounds. From simple design decisions such as the substance vs. compound distinction [that allows PubChem to avoid the early mistakes of CAS] to breakthroughs such as canonical Kekule SMILEs [to avoid the early mistakes of Daylight Chemical Information Systems], the architecture of Pubchem contains a treasure trove of cheminformatics innovations, covering normalization, tautomers, mixtures, 2D fingerprints and similarity, substructure search, biopolymers, text mining and much more. During this presentation I hope to share some of the cool insights that the remarkable staff at the NCBI often forget to mention or are too modest to point out.

Congratulations Evan and Steve.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

acknowledgements

• Evan Bolton, Steve Bryant, Paul Thiessen, Volker Hähnke, David Lipman and the PubChem team at the NCBI.

• John May, at NextMove Software, for the analysis of PubChem atom types affected by Biovia changes.

• The rest of the team at NextMove Software.

• George Vacek and the team at OpenEye Scientific Software.

252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016


Recommended