DATA-RICH CHEMISTRY INSIDE WIKIPEDIA &
OTHER WIKIS
Martin A Walker, SUNY Potsdam
OVERVIEW Chemical data in Wikipedia Validation of Wikipedia chemical data RSC Learn Chemistry Conclusion
SUBSTANCE DATA IN WIKIPEDIA Wikipedia is designed as an encyclopedia, NOT a
database, BUT many cheminformatics groups want to use data from Wikipedia
Since most data are entered by a human being, rather than by machine, Wikipedia can often provide a data source that is independent of the main online databases
Could the Wikipedia chemists make the data more accessible without compromising the project’s mission? What about DBpedia?
CHEMBOXES & DRUGBOXES The Chembox on a substance page
contains standard representations such as Skeletal formula IUPAC name InChI and InChIKey CAS no. (represents substance, not structure) SMILES (proprietary but de facto standard before
InChI)
These were traditionally supplied for use by readers to copy/paste, but we were asked to make a machine-friendly version
WIKIPEDIA DRUG PAGES
EARLY CHEMBOXES
Chemboxes were originally set up as tables – OK for people, but not for data mining.
A typical chembox From 2007
NEW CHEMBOXES Now designed as a set of data
fields with values entered by the editor – better for data extraction and for validation
Drugboxes also redesigned Machine-friendly formats
(SMILES, InChI, InChIKey, CAS Reg. No.) included in nearly all chemboxes
Hide/show used to avoid table “explosions”
Collections of Wikipedia data are now available for cheminformatics groups to use
CURRENT FORM OF CHEMBOX
SIMPLE FULL FORM
TABLE EXPLOSIONS!Some data (e.g., InChIs for complex molecules) can be very long – and this was a hindrance to their use in Wikipedia
VALUE OF THE INCHI AND INCHIKEY InChI can be used to define what
structure is being represented when compiling a virtual database.
InChI can provide an unambiguous reference when validating structures on Wikipedia
InChIKey is useful to help those using search engines
DATA PAGESPROBLEM: Table creep – users ask for the table to include the Standard Free Energy of Hydroformylation in a Black Box
ANSWER: Put it on a sub-page – the supplementary data page (something unique to chemistry!).Click on a link from the bottom of the Chembox:
DATA PAGES
DATA VALIDATION
DATA VALIDATIONHow I use the key terms:
Validation =>“How I can be sure the data are correct?”
Curation => an ongoing process of fixing errors
CONTENT VALIDATION In 2008 a data validation drive
was initiated for basic chemical identifiers
Led to a collaboration with CAS, to ensure Wikipedia CAS registry nos. are correct
Now around 3500+ substances have been validated against CAS Common Chemistry, as having correct name, structure & CAS RN
Other fields now being validated Validated content indicated with a
check mark
THE APPROACH TO VALIDATIONEvery old version (called a RevID) of an article is preserved (for all) for posterity, and can potentially serve as a permanent record of a validated version.
PROTECTING VALIDATED FIELDSPROBLEM: This is “the encyclopedia anyone can edit” – so anyone can change the BP of water to 200 oC.
SOLUTION: A bot patrols the pages, and watches for edits to key fields. Any dubious edits are flagged with a red X (next to the data), and logged.
System developed by Dirk Beetstra (Eindhoven University of Technology). It is the only such tool on Wikipedia.
VALIDATION PROTECTED BY BOT
If anyone tries to vandalize a validated field, this will be flagged by a bot soon afterwards.
This example received a red X 11 minutes after it was vandalized.
VALIDATED REVISIONIDS
CHECKING STRUCTURES IN 2008-2010, around 3000 chemical
structures were informally checked against CAS Common Chemistry
PROBLEM: Structures are loaded from an external file on Wikimedia Commons, which can be “invisibly” changed
SINCE FALL 2010The bot has been modified to watch changes to the RevID of the Wikimedia Commons structure imageA few hundred images validated so far
DRUGBOXES
Drugboxes are patrolled by the bot, but at present WP:PHARM not active in formal validation. Most work done by Dirk Beetstra, using official lists from data sources (e.g., ChEBI).
RSC LEARN CHEMISTRY
RSC LEARN CHEMISTRY WIKIAims to enrich RSC educational content with data from ChemSpider, then make it open for educators to contribute their own content (licensed under Creative Commons)
SUBSTANCE SEARCHES
SUBSTANCE PAGES: FOUND BY INCHI SEARCH
WITH LINKS TO SPECTRA:
QUIZZES: “PREDICT THE PRODUCT”
QUIZZES
QUIZZES
INCHI PROVIDES THE WAY
CONCLUSION Wikipedia can provide a useful “virtual
database” of highly curated information on common chemicals and drugs.
Don’t forget the data page information! The validation effort needs to go further –
YOUR help is very welcome! RSC Learn Chemistry shows that chemical
data can also be used to enrich an educational site.
ACKNOWLEDGEMENTS Congratulations to Henry and Peter, and
thanks for the invitation to speak in their symposium.
Thanks to Antony Williams for his many contributions to both Wikipedia and Learn Chemistry.
Thanks to Aileen Day, Lorna Thomson, Duncan McMillan and RSC Education staff, and to RSC for the funding of Learn Chemistry.
Thanks to undergraduate student Tyson Terpstra for uploading many quiz InChIs.
Thank you for your attention!
ANY QUESTIONS?
Thank you for your attention
COPYRIGHT INFORMATION All of my own content in this presentation is
released under a Creative Commons BY-SA-3.0 license
Copyright information for images is usually attributed on the slide itself
Content from Wikipedia and Learn Chemistry is reused via a Creative Commons BY-SA-3.0 license. For authors, please visit the original Wikipedia page and select the “history” tab.
Other pictures not attributed should only be my own personal pictures, also CC-BY-SA3.