Chemical databasing: state of the art and current
challenges
Valery Tkachenko
Royal Society of Chemistry
Kazan Summer School on Cheminformatics
Kazan, Russia
July 6th 2015
Why databases?
Efficient storage
Quick access (browse, search)
ACID (Atomicity, Consistency, Isolation,
Durability)
Scalability
Migrations
Security
Safety (backup/restore)
Database – model and data
Database – relational example
Chemical database
Chemistry-specific searches
Identity – same atoms connected in the same way
Substructure – find all chemicals having query as a substructure
Superstructure – find all chemicals which are substructures of a query
Similarity – find all “similar” chemicals
InChI (http://www.inchi-trust.org/)
Pidolic acid
Fingerprints
Human Molecule
SciFinder
Reaxys
PubChem
• 32 million chemicals and growing
• Data sourced from >500 different sources
• Crowdsourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• A structure centric hub for web-searching
ChemSpider
ChemSpider
Properties - experimental
Properties - ACDLabs
Properties – EPI Suite
Properties - ChemAxon
Literature references
Patents references
Books
Classification
Chemical vendors and datasources
Multimedia
Dimensions and complexity of science
Chemical space - 1060
RSC Archive – since 1841
Digitally Enabling RSC Archive
Advanced Search
It is so difficult to navigate…
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
target? Pharmacology
data?
Known
Pathways?
Working On
Now? Connections
to disease?
Expressed in
right cell type?
Competitors?
IP?
ChemSpider Synthetic Pages
Compounds
Reaction
Analytical Data
Text and References
Electronic Laboratory Notebook (ELN)
RSC Data Repository
Data Repository
PropertiesNames and Identifiers
Spectra ArticlesData
CollectionsPatents Etc
Input pipeline
Output pipeline
RSC Databases
RSC Compounds
RSC Reactions
RSC Spectra
RSC Crystals
RSC Polymers
RSC Materials
RSC Assays
RSC Algorithms
RSC Models
…and on…
Compounds domain
Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and
private databases • ChemSpider
• PubChem
• DrugBank
• KEGG
• ChEBI/ChEMBL
– Automated quality control system
Chemistry Validation and Standardization Platform
Chemistry Validation and Standardization Platform
Reactions domain
Reactions domain
• ChemSpider Synthetic Pages
• Methods in Organic Synthesis
• Catalysts and Catalyzed Reactions
• USPTO
Reactions domain
Analytical data domain
Crystallography domain
3D printable structures
We are a part of a larger world
Who is involved?
29 partners
Research questions
OpenPHACTS Architecture
OpenPHACTS UI
http://explorer.openphacts.org/
National Chemistry Database