Date post: | 25-Jun-2015 |
Category: |
Technology |
Upload: | antony-williams-chemconnector-orcid-0000-0002-2668-4821 |
View: | 1,778 times |
Download: | 0 times |
ChemSpider -Connecting and Curating Online Chemistry Resources
Antony WilliamsEBI, November 30th 2010
Chemistry on the Internet 100s of websites serving up chemistry data, SDF
files of structures and data Some primary resources : PubChem, ChEBI,
DrugBank, ChemIDPlus, Wikipedia
ChemSpider “links” chemistry on the internet Almost 25 million compounds, 400 data sources Allows community deposition, curation, annotation Integrating properties, publications, patents, media Text, structure, substructure (in testing) searching
www.chemspider.com
Search for a Chemical
Available Information…
Linked to vendors, safety data, toxicity, metabolism
We Have Delivered the Vision
“Build a Structure Centric Community toServe Chemists”
Integrate chemical structure data on the web Create a “structure-based hub” to information,
data and algorithmic predictions Let chemists contribute their own data Allow the community to curate/correct data
How Did We Build It?
We deal in Molfiles or SDF files – including coordinates
We do rudimentary filtering – valence checking, charge imbalance – prior to deposition
We have our own “business logic” to standardize We use InChI to “aggregate tautomers” to one
record Link out to external sites where possible using IDs
Inherited Errors
We have inherited errors from every database… all public compound databases, including ours, have errors
“Incorrect” structures – assertions, timelines etc “Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE
What is the Structure of Vitamin K?
MeSH
A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
What is the Structure of Vitamin K1?
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
Wikipedia
ChEBI – Manual Curation
PubChem
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”
Variants of systematic names on PubChem
2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Public Domain Chemistry Databases
Our databases are a mess…
Non-curated databases are proliferating errors We source and deposit data between databases Original sources of errors hard to determine Curation is time-consuming, challenging and
exacting
An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Symbicort: Budesonide + Formoterol
Symbicort: Budesonide + Formoterol
ChemIDPlus
Wikipedia
DrugBank: Search Symbicort…
Symbicort: Budesonide + Formoterol PubChem
8 structures called Budesonide. 1 “correct” 6 structures called Formoterol. 1 “correct” Search on “Symbicort” gives 1 structure.
Taxol: Paclitaxel 44 structures
Taxol: Paclitaxel Bioassay Data
Taxol: Paclitaxel Bioassay Data
Most Bioassay data associated with structure with one ambiguous stereocenter
Data on the Web – Good or Bad??
Taken from: Rafael Sidis’ Blog
Data on the Registry
Data on the Registry
Data on the Registry
How are data handled in Pharma?
Algorithms for “collapsing” data? Skeletons only? Processing structure-name pairs? Manual curation? Does it matter relative to the noise in the
measurements?
Do correct structure representations matter, and to who?????
EPA’s DailyMed
EPA’s DailyMed
EPA’s DailyMed
Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.
Drug Name Generic Name ChEBI ChemSpiderCAS Com.
Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia
SpirivaTiotropium Bromide
No Hits No Hits 4/0
DepakoteValproate semisodium No
Structure
Basen Voglibose No Hits No Hits 2/1 Symbicort 1) Budesonide 8/1 Symbicort 2) Formoterol WRONG No Hits 6/1 Vytorin 1) Ezetimibe No Hits Vytorin 2) Simvastatin 2/1 Taxol Paclitaxel 44/1 Thalidomid Thalidomide No Hits Zocor Simvastatin 2/1 Crestor Rosuvastatin No Hits 2/1
Why Curated Dictionaries Matter
Success Depends on Dictionaries
Online Curation
Online databases generally do NOT allow curation or annotation
If you find errors they stay there! ChemSpider allows immediate curation
Crowdsourcing Works
Over 100 people have deposited data (structures, spectra, etc) and participated in data curation
Different level curators check each others work Wikipedia is the modern primary example Some curators are “madmen”…
Crowdsourcing Works
Over 100 people have deposited data (structures, spectra, etc) and participated in data curation
Different level curators check each others work Wikipedia is the modern primary example Some curators are “madmen”… The Oxford English Dictionary
Collaborative Data Curation
How can we COLLECTIVELY clean online data?
ChemSpider has inherited junk from >400 data sources. Some of this has proliferated into PubChem. We should deprecate it.
We need to develop a way to share curation actions back to original data sources
A mindset of bigger is better is problematic. How many “real chemicals” are in the public databases?
ChemSpider
ChemSpider is free to use. Multiple web services are available. New data added daily. Curation and data validation ongoing everyday. Provided by the RSC.
www.chemspider.com
Thank you
Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams