Inchi for large molecules: The nextmove software perspective
Roger Sayle & Noel O’Boyle
Nextmove software, cambridge, uk
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
“this house believes…”
• The most important distinction in life science informatics is between molecular and non-molecular (bio)chemistry, not between chemistry and biology.
• Fuzzy distinctions such as “small molecules”, lipids, proteins, nucleic acids, peptides, oligosaccharides, or terpenes are like asking how many colors are there in a rainbow? (c.f. The Sapir-Whorf hypothesis).
• Schemes that encode these distinctions (such as HELM and ISO 11238 even RasMol) break down when (poorly defined) categories overlap.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Peptide or not?
cyclo[OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val]
valinomycin
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Saccharide or not?
D-Glucopyranose D-gluco-hexopyranose
(2S)-2-methyloxane (2S)-2-methyl-tetrahydropyran
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Saccharide or not?
D-Glucopyranose D-gluco-hexopyranose
D-Quinovopyranose 6-deoxy-Glucopyranose
6-deoxy-D-gluco-hexopyranose
D-Paratopyranose 3,6-dideoxy-Glucopyranose
3,6-dideoxy-D-ribo-hexopyranose
D-Amicetopyranose 2,3,6-trideoxy-Glucopyranose
2,3,6-trideoxy-D-erythro-hexopyranose
(2S)-2-methyloxane (2S)-2-methyl-tetrahydropyran
The cutting edge of biosimilarity
• The high prevalence of potentially life-threatening hypersensitivity reactions to the antibody cetuximab (Erbitux) in some US states has been traced to its glycosylation [containing a Gal(a1-3)Gal epitope].
Chung et al., “Cetuximab-induced anaphylaxis and IgE specific for galactose-alpha-1,3-galactose”, New England Journal of Medicine, Vol. 358, No. 11, pp. 1109-1117, 13th March 2008.
• Similarly, Human Erythropoietin (EPO) alpha, beta, delta and omega share the same primary sequence, but differ in their glycosylation patterns.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Destructive suggestion…
• Systems based upon monomer dictionaries (such as HELM and PDB) are notoriously difficult to maintain.
• The limited number of monomers in proteinogenic peptides and natural nucleic acid sequences leads to a false sense of security; that monomers are finite.
• In practice, the number of monomers, post-translational and chemical modifications is infinite.
• Even more difficult than standardizing monomer definitions via a central repository, like PDB, is allowing local custom definitions.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Constructive suggestion…
• Ideally, a chemical identifier should be independent of the input representation or file format.
• Duplicates between small molecules, peptide and proteins are best determined by a single identifier, preferably the existing InChI.
• This is possible as increases in computer power and storage mean that cheminformatics toolkits can handle huge biopolymers on modern hardware.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Proof-of-concept
• I’ve previously reported on Tanimoto chemical search of PDB (80K) represented as canonical SMILES (1Gb).
• To test for duplicates and InChI key hash collisions, we attempted to generate InChI keys for uniprot.
• OpenBabel source tree already contains patches to InChI library to increase the official 1024 atom limit.
• A few additional source changes also helped.
• Ultimately, InChI keys could be generated for ~99.4% of the ~450K unique sequences in swissprot division.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Record breaking inchi-key
• Sequence Identifier: UTP10_KLULA
• Sequence Length: 1774 amino acids
• Molecule size: 28509 atoms
• InChI Length: 119699 characters
• InChI key: PHBRSEQMAKHFGD-ZBXWIJJNSA-N
• InChI Canonicalization Time: 73.2s
• Canonical SMILES Length: 35408 chars
• SMILES Canonicalization Time: 0.4s
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
protein Canonicalization time
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
protein Canonicalization time
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
protein Canonicalization time
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
conclusions
• “InChI for large molecules” simply requires fixing the bugs in standard InChI.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
acknowledgements
• Lisa Sach-Peltason, Hoffmann-La Roche, Basel.
• Joann Prescott-Roy, Novartis, Boston, MA.
• Greg Landrum, Novatis, Basel, Switzerland.
• Evan Bolton, NCBI PubChem project, Bethesda, MD.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
PDB
IUPAC NAME L-Cys(1)-L-Tyr-L-Ile-L-Gln-L-Asp-L-Cys(1)-L-Pro-L-Leu-Gly-NH2
IUPAC Condensed
[C@H]1(CCCN1C(=O)[C@@H]1CSSC[C@@H](C(=O)N[C@@H](Cc2ccc(cc2)O)C(=O)N[C@@H]([C@H](CC)C)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CC(=O)O)C(=O)N1)N)C(=O)N[C@@H](CC(C)C)C(=O)NCC(=O)N
SMILES
DEPICTIONS
Sugar & SPLICE
L-cysteinyl-L-tyrosyl-L-isoleucyl-L-glutaminyl-L-alpha-aspartyl-L-cysteinyl-L-prolyl-L-leucyl-glycinamide (1->6)-disulfide
common NAME
[5-L-aspartic acid]oxytocin
OH
PLN
H-C(1)YIQDC(1)PLG-[NH2]
PEPTIDE1{C.Y.I.Q.N.C.P.L.G.[am]}$PEPTIDE1,PEPTIDE1,1:R3-6:R3$$$
helm
Competing interests statement
Peptide names imply architecture
• Named peptides imply not only sequence but also N-terminal acetylation, C-terminal amidation and disulfide bridge topology.
• Example named derivatives: – gastrin (14-17)
– motilin amide
– oxytocin free-acid
– acetyl-oxytocin
– deacetyl-abarelix
– oxytocin reduced
– endothelin-1 (1→3),(11 → 15)-bis(disulfide)
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014