Date post: | 10-Jul-2015 |
Category: |
Science |
Upload: | karen-karapetyan |
View: | 157 times |
Download: | 2 times |
Chemistry Validation and Standardization Platform
Modularization and “Hadoop”ization
Kenneth Karapetyan, Colin Batchelor,
Valery Tkachenko, Antony WilliamsACS New Orleans April 2013
Overview
• Motivation• What we support• Modularization• Parallelization• Examples
Motivation: validation
Open and free chemical validation system for:
•Structure validation– Warn on query atoms, pseudo atoms, polymers, etc.
– Nonsensical stereo
•SDF field mapping for validating depositor-provided names, InChI, SMILES
Motivation: standardization
Allows users to use CVSP default standardization workflow (or FDA, Open PHACTS and so on)
Allows users to put together their own workflow using modules provided:•Apply default CVSP or user-defined SMIRKS rules•Layout•Neutralize•Get canonical tautomer using ChemAxon’s algorithms•Get biggest organic fragment
What we support
• SD files and mol files• ChemDraw files (in-house code)• Tab-delimited text files of names, InChIs,
SMILES
• Zipped files• GZipped files
CVSP: modularization
Reusable workflows
SMIRKS-based rules
“Hadoop”ization
Apache Hadoop is a framework for the distributed processing of large data sets across clusters of computers.
CVSP is written in C#. To run it on Linux machines we use Mono (cross-platform .NET runtime environment)
Farm:•28 CPU cores•42G memory•2T disk space
Processor intensive tasks•Tautomerization
Input fileDeposit ID in
database
Upload to farm for processing on Hadoop
Hadoop processing
Download resultsUpload results to database for user
preview
Convert to SD format
Hadoop queues
Three Hadoop queues are used (capacity queue) to prioritize big/large CVSP submissions
•“Small” submission queue for submissions under 500 records•Large submissions queue•Internal queue
– For internal projects, e.g. tautomer analysis of ChemSpider or ChemSpider standardization
All records have to be processed on Hadoop to user to see the results (no partial preview)
Examples
DrugBank •~6500 records, approximately 2 records per second
PubMed•~100 000 records, about 9 h
Rate-limiting step?
Canonical tautomerization
This molecule took
45 min to
canonicalize.
DrugBank dataset (6516 records)
Errors
•2 records with query(any) bond•2 records with R groups•3 polymers•18 porphyrins with metal coordinated inside with one of the metal-nitrogen bonds stereogenic•Unusual valence: ~20
Warnings
•INCHI not matching structure (100+)•SMILES not matching structure (100+)
DrugBank ID: DB00755InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+
DrugBank ID: DB00614
Stereo issues
DB08128 DB06287
J. Brecher, Pure Appl. Chem., 2008, doi:10.1351/pac200880020277