The RSC chemical validation and standardization platform, a potential path to quality-conscious...

transcript

Chemistry Validation and Standardization Platform

Modularization and “Hadoop”ization

Kenneth Karapetyan, Colin Batchelor,

Valery Tkachenko, Antony WilliamsACS New Orleans April 2013

Overview

• Motivation• What we support• Modularization• Parallelization• Examples

Motivation: validation

Open and free chemical validation system for:

•Structure validation– Warn on query atoms, pseudo atoms, polymers, etc.

– Nonsensical stereo

•SDF field mapping for validating depositor-provided names, InChI, SMILES

Motivation: standardization

Allows users to use CVSP default standardization workflow (or FDA, Open PHACTS and so on)

Allows users to put together their own workflow using modules provided:•Apply default CVSP or user-defined SMIRKS rules•Layout•Neutralize•Get canonical tautomer using ChemAxon’s algorithms•Get biggest organic fragment

What we support

• SD files and mol files• ChemDraw files (in-house code)• Tab-delimited text files of names, InChIs,

SMILES

• Zipped files• GZipped files

CVSP: modularization

Reusable workflows

SMIRKS-based rules

“Hadoop”ization

Apache Hadoop is a framework for the distributed processing of large data sets across clusters of computers.

CVSP is written in C#. To run it on Linux machines we use Mono (cross-platform .NET runtime environment)

Farm:•28 CPU cores•42G memory•2T disk space

Processor intensive tasks•Tautomerization

Input fileDeposit ID in

database

Upload to farm for processing on Hadoop

Hadoop processing

Download resultsUpload results to database for user

preview

Convert to SD format

Hadoop queues

Three Hadoop queues are used (capacity queue) to prioritize big/large CVSP submissions

•“Small” submission queue for submissions under 500 records•Large submissions queue•Internal queue

– For internal projects, e.g. tautomer analysis of ChemSpider or ChemSpider standardization

All records have to be processed on Hadoop to user to see the results (no partial preview)

Examples

DrugBank •~6500 records, approximately 2 records per second

PubMed•~100 000 records, about 9 h

Rate-limiting step?

Canonical tautomerization

This molecule took

45 min to

canonicalize.

DrugBank dataset (6516 records)

Errors

•2 records with query(any) bond•2 records with R groups•3 polymers•18 porphyrins with metal coordinated inside with one of the metal-nitrogen bonds stereogenic•Unusual valence: ~20

Warnings

•INCHI not matching structure (100+)•SMILES not matching structure (100+)

DrugBank ID: DB00755InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+

DrugBank ID: DB00614

Stereo issues

DB08128 DB06287

J. Brecher, Pure Appl. Chem., 2008, doi:10.1351/pac200880020277

Thank you

E-mail: karapetyank@rsc.org, batchelorc@rsc.org

Please try CVSP at

http://cv.beta.rsc-us.org

The RSC chemical validation and standardization platform, a potential path to quality-conscious...

Science