Sourcing high quality online data resources for computational toxicology

Post on 11-May-2015

1,105 views 0 download

Tags:

description

The internet continues to offer increased access to chemistry data that may be of value to scientists interested in populating systems containing reference toxicology data as well as to provide data for the development of predictive models. This presentation will give an overview of some of the various sources of data available via the internet, provide an overview of some of the challenges associated with gathering high-quality data and discuss methods by which to mesh together disparate data sources.

transcript

Sourcing High-Quality Online Data Resources for Computational Toxicology

Antony WilliamsBio-IT World, Current Methods for Computational Toxicology and Chemogenomics

The Community Depends On Us

“We don’t want another Love Canal!” “What we know about PCBs should warn us all!”

The public is “suspicious” of pharma… “Chemicals are dangerous”

Comp Tox Models Depend on DATA

Models for Computational Toxicology depend on the quality of the training set

There are multiple issues with data quality including: Experimental

The validity of the method, Reproducibility, Sample quality, Data capture, Transcription of values

Computational Accurately representing the data – correct units,

annotations, quality flags, attribution, are the structures correct?

Nothing but the Facts

Jean-Claude Bradley, Drexel University

“There are no facts, only measurements embedded

within assumptions”

Open Notebook Science

UsefulChem Blog: http://tinyurl.com/48dyujh

Aqueous Solubility of ECGC

Epigallocatechin gallate solubility in water

Melting Point of DMT

Content is King and Quality Costs Chemistry “content” is big money

Patent searching Structures and properties Drug databases Literature databases

Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information 101 years of content $260 million revenue (2006) >50 million substances >60 million sequences

Where can we find data online?

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Lots of “Public Compound” Databases

PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider

Toxicology Data

Chemistry on the Internet ChemSpider “links” chemistry on the internet

Almost 25 million compounds, 400 data sources Allows community deposition, curation, annotation Integrating properties, publications, patents, media Text, structure and substructure searching

www.chemspider.com

Search for a Chemical

Available Information…

Linked to vendors, safety data, toxicity, metabolism

We Have Delivered the Vision

“Build a Structure Centric Community toServe Chemists”

Integrate chemical structure data on the web Create a “structure-based hub” to information,

data and algorithmic predictions Let chemists contribute their own data Allow the community to curate/correct data

Dialects describing chemicals

What is the Structure of Vitamin K?

MeSH

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

What is the Structure of Vitamin K1?

What is the Structure of Vitamin K1?

CAS’s Common Chemistry

Wikipedia

ChEBI – Manual Curation

PubChem

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Public Domain Chemistry Databases

Our databases are a mess…

Non-curated databases are proliferating errors

We source and deposit data between databases

Original sources of errors hard to determine

Curation is time-consuming, challenging and exacting

Vancomycin

Who will curate?

PubChem is not resourced to clean these errors

How would you clean such a large dataset?

The FDA’s DailyMed

Structures on DailyMed

Lack of Stereochemisty

Incorrect Structures

Wow!

We want to model DILI…

Drug metabolism in the liver can convert some drugs into highly reactive intermediates,

This can affect the structure and functions of the liver. Drug-induced liver injury (DILI), is the #1 reason

drugs are not approved and withdrawn from market after approval

Estimated global annual incidence rate of DILI is 13.9-24.0 per 100,000 inhabitants

DILI accounts for an estimated 3-9% of all adverse drug reactions reported to health authorities

Herbal components can cause DILI too

Thanks to Sean Ekins https://dilin.dcri.duke.edu/for-researchers/info/

Initial DILI data – Names and Data

Griseofulvin Hycanthone Hydrochlorothiazide Hydrocortisone Hydroxyurea Idarubicin HCl Idoxuridine Imipramine HCl indomethacin

isoniazid Isoproterenol HCl Isotretinoin Isoxsuprine HCl Kanamycin Sulfate Ketorolac

Tromethamine Ketotifen Labetalol

So you want data on drugs???

Sourcing data based on drug names is difficult!

Where would you find the “correct chemical structures”?

What databases can you trust?

Vytorin: Ezetimibe/Simvastatin

Vytorin: Ezetimibe/Simvastatin

Vytorin: Ezetimibe/Simvastatin

Vytorin: Ezetimibe/Simvastatin

Vytorin: Ezetimibe/Simvastatin

Symbicort: Budesonide + Formoterol

Symbicort: Budesonide + Formoterol

ChemIDPlus

Wikipedia

DrugBank: Search Symbicort…

Symbicort: Budesonide + Formoterol PubChem

8 structures called Budesonide. 1 “correct” 6 structures called Formoterol. 1 “correct” Search on “Symbicort” gives 1 structure.

Taxol: Paclitaxel 44 structures

Taxol: Paclitaxel Bioassay Data

Public Domain Chemistry Databases

An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs

Drug Name Generic Name ChEBI ChemSpiderCAS Com.

Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia

SpirivaTiotropium Bromide

No Hits No Hits 4/0

DepakoteValproate semisodium No

Structure

Basen Voglibose No Hits No Hits 2/1 Symbicort 1) Budesonide 8/1 Symbicort 2) Formoterol WRONG No Hits 6/1 Vytorin 1) Ezetimibe No Hits Vytorin 2) Simvastatin 2/1 Taxol Paclitaxel 44/1 Thalidomid Thalidomide No Hits Zocor Simvastatin 2/1 Crestor Rosuvastatin No Hits 2/1

Personal Experiences

Highest Quality Resources : DSSTox (EPA), ChEBI (EBI)

High Quality Resources : DrugBank, Human Metabolome Database, ChemIDPlus, ChemSpider, KEGG

Are there others you use???

What can be done to help?

“Crowdsourcing” – gather the support of members of the community to add, annotate and curate data

Wikipedia is the domain success story for crowdsourcing. PubChem is an example of “crowdsourced

deposition” of chemistry data ChemSpider is an example of “crowdsourced

deposition and curation”

Open source software : descriptors and algorithms QSAR should be cheaper and better! Selectively share your models with collaborators Centralized hosting of models / predictions

The Future: Open Source and Data

The Future: Open PHACTS

The Open PHACTS project will develop an open access innovation platform, called Open Pharmacological Space (OPS), via a semantic web approach. OPS will be comprised of data, vocabularies and infrastructure needed to accelerate drug-oriented research.

Exposing Data for Semantic Web…

Coming soon…

Book chapter:

“Accessing, Using and Creating Chemical Property Databases For Computational Toxicology Modeling”

Antony J. Williams, Sean Ekins, Ola Spjuth and Egon L. Willighagen

Thank you

Email: williamsa@rsc.org Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams