ChemSpider as a Platform for Crowd Participation in Curating Chemistry

Post on 10-May-2015

1,273 views 1 download

Tags:

description

This is a presentation I gave at the International Digital Curation Conference in Chicago, December 7th 2010, #idcc10. The presentation discusses the issues of data quality and the need for collective, crowdsourced efforts to improve the quality of chemistry related data on the Internet

transcript

ChemSpider as a Platform for Crowd Participation in Curating Chemistry

Antony WilliamsIDCC, Chicago, December 2010

WARNING: Chemistry is Dangerous

Di-Hydrogen Monoxide

Di-Hydrogen Monoxide

2H

Di-Hydrogen Monoxide

2H + 1O

Di-Hydrogen Monoxide

H2O

Di-Hydrogen Monoxide

H2OWater

It’s all on Wikipedia…

Chemistry on the Internet – Not All Bad

100s of websites hosting chemistry-related data Chemistry information is generally “compound-based”

Chemical “structures” Identifiers, names and synonyms Properties Analytical data How to synthesize Articles, patents, safety information

Chemistry “language and dialects”

Dialects describing chemicals

A Pragmatic Vision

“Build a Structure Centric Community”

Integrate chemistry across the internet based on “chemical structure”

A “structure-based hub” to information and data Let chemists contribute their own data Allow the community to curate & annotate data

www.chemspider.com

Answering Questions for Chemists Questions a chemist might ask…

What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Aspirin? What is the NMR spectrum of Benzoic Acid? What are the safety handling issues for toluene?

Search for a Chemical…by name

Available Information… Linked to chemical vendors, safety data, toxicity,

metabolism…

Available Information….

ChemSpider Today

Almost 25 million unique chemicals Over 400 data sources Grows daily – community and RSC depositions Community annotation and curation

We curate, edit, change, enhance data daily

Three Years of Experience Internet-based chemistry is a mess!

Public compound databases are contaminated

The annotation/curation of data online is difficult

Most database hosts are non-responsive to feedback – “We are a host/repository of data”

Who cares?

Linked Data on the Web

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

What is the Structure of Vitamin K?

MeSH – Medical Subject Headings

Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione).

What is the Structure of Vitamin K1?

What is the Structure of Vitamin K1?

Chemical Abstracts“Common Chemistry” Database

Wikipedia

Incorrect Structures

Lack of Stereochemistry

Does stereochemistry matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

PubChem

What’s Methane?

What’s Methane?

What ELSE is Methane???

Internet-Based Chemistry is a Mess

Algorithms can get you so far

Human curation is necessary

Only the crowds can help with big data… ChemSpider is approaching 25 million compounds

Search “Vitamin H”

Search “Vitamin H”

“Curate” Identifiers

“Curate” Identifiers

“Curate” Identifiers

Crowd-sourcing Chemistry Curation

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

“Curate” Identifiers

General curation activities Remove incorrect names Correct spellings Add multilingual names Add alternative names

In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually

130 people have participated in validation or annotation. “Crowds” can be quite small!

Crowdsourcing Works

The “crowd” has deposited data (structures, spectra, etc) and participated in data curation

Different level curators check each others work Wikipedia is the modern primary example Some curators are “madmen”…

Crowdsourcing Works

The “crowd” has deposited data (structures, spectra, etc) and participated in data curation

Different level curators check each others work Wikipedia is the modern primary example Some curators are “madmen”… The Oxford English Dictionary

Vancomycin – Curate This!!!

Vancomycin on ChemSpider 1 compound – 3 days

Crowdsourced “Annotations”

Users can add Descriptions/Syntheses/Commentaries Links to articles Spectral data Photos MP3 files Videos

Multimedia Content Holder

Gaming for Curation of Spectra

ChemSpider EverywhereCrowdsourced Curation of Spectra

Data Curation

True Curation of Data

ChemSpider SyntheticPages

Drug Name Generic Name ChEBI ChemSpiderCAS Com.

Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia

SpirivaTiotropium Bromide

No Hits No Hits 4/0

DepakoteValproate semisodium No

Structure

Basen Voglibose No Hits No Hits 2/1 Symbicort 1) Budesonide 8/1 Symbicort 2) Formoterol WRONG No Hits 6/1 Vytorin 1) Ezetimibe No Hits Vytorin 2) Simvastatin 2/1 Taxol Paclitaxel 44/1 Thalidomid Thalidomide No Hits Zocor Simvastatin 2/1 Crestor Rosuvastatin No Hits 2/1

Sharing Our Activities

Presently defining approaches with other public compound databases to share results of curation activities

Member of large European project to link data from the Life Sciences. Sharing results of curation is essential

Making curation and contribution interfaces Mobile

Mobile ChemSpider

First request to Database Hosts!

Every public compound database host should add ONE feature – “Leave Comments”

Second request to Database Hosts! Show Comments

Question Quality

Thank you

Email: williamsa@rsc.org Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams