+ All Categories
Home > Science > The RSC chemical validation and standardization platform, a potential path to quality-conscious...

The RSC chemical validation and standardization platform, a potential path to quality-conscious...

Date post: 10-Jul-2015
Category:
Upload: karen-karapetyan
View: 157 times
Download: 2 times
Share this document with a friend
Popular Tags:
20
Chemistry Validation and Standardization Platform Modularization and “Hadoop”ization Kenneth Karapetyan, Colin Batchelor, Valery Tkachenko, Antony Williams ACS New Orleans April 2013
Transcript
Page 1: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Chemistry Validation and Standardization Platform

Modularization and “Hadoop”ization

Kenneth Karapetyan, Colin Batchelor,

Valery Tkachenko, Antony WilliamsACS New Orleans April 2013

Page 2: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Overview

• Motivation• What we support• Modularization• Parallelization• Examples

Page 3: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Motivation: validation

Open and free chemical validation system for:

•Structure validation– Warn on query atoms, pseudo atoms, polymers, etc.

– Nonsensical stereo

•SDF field mapping for validating depositor-provided names, InChI, SMILES

Page 4: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Motivation: standardization

Allows users to use CVSP default standardization workflow (or FDA, Open PHACTS and so on)

Allows users to put together their own workflow using modules provided:•Apply default CVSP or user-defined SMIRKS rules•Layout•Neutralize•Get canonical tautomer using ChemAxon’s algorithms•Get biggest organic fragment

Page 5: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

What we support

• SD files and mol files• ChemDraw files (in-house code)• Tab-delimited text files of names, InChIs,

SMILES

• Zipped files• GZipped files

Page 6: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

CVSP: modularization

Page 7: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Reusable workflows

Page 8: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

SMIRKS-based rules

Page 9: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases
Page 10: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases
Page 11: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases
Page 12: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

“Hadoop”ization

Apache Hadoop is a framework for the distributed processing of large data sets across clusters of computers.

CVSP is written in C#. To run it on Linux machines we use Mono (cross-platform .NET runtime environment)

Farm:•28 CPU cores•42G memory•2T disk space

Processor intensive tasks•Tautomerization

Page 13: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Input fileDeposit ID in

database

Upload to farm for processing on Hadoop

Hadoop processing

Download resultsUpload results to database for user

preview

Convert to SD format

Page 14: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Hadoop queues

Three Hadoop queues are used (capacity queue) to prioritize big/large CVSP submissions

•“Small” submission queue for submissions under 500 records•Large submissions queue•Internal queue

– For internal projects, e.g. tautomer analysis of ChemSpider or ChemSpider standardization

All records have to be processed on Hadoop to user to see the results (no partial preview)

Page 15: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Examples

DrugBank •~6500 records, approximately 2 records per second

PubMed•~100 000 records, about 9 h

Page 16: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Rate-limiting step?

Canonical tautomerization

This molecule took

45 min to

canonicalize.

Page 17: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

DrugBank dataset (6516 records)

Errors

•2 records with query(any) bond•2 records with R groups•3 polymers•18 porphyrins with metal coordinated inside with one of the metal-nitrogen bonds stereogenic•Unusual valence: ~20

Warnings

•INCHI not matching structure (100+)•SMILES not matching structure (100+)

Page 18: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

DrugBank ID: DB00755InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+

DrugBank ID: DB00614

Page 19: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Stereo issues

DB08128 DB06287

J. Brecher, Pure Appl. Chem., 2008, doi:10.1351/pac200880020277

Page 20: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Thank you

E-mail: [email protected], [email protected]

Please try CVSP at

http://cv.beta.rsc-us.org


Recommended