Current opinions in drug discovery public compound databases

Page 1 of 37

Public Chemical Compound Databases

Antony J. Williams

Address: ChemZoo Inc., 904 Tamaras Circle, Wake Forest, NC-27587

Corresponding Author:[email protected]

PHONE: 919 341-8375

The internet has fast become the first port of call for all searches. The

increasing array of chemistry-related resources now available provides chemists a

direct path to the discovery of information, one previously accessed via library

services and limited to commercial and costly resources. The diversity of information

available online is expanding at a dramatic rate and a shift to publicly available

resources offers significant opportunities in terms of the benefit to science and

society. While the data available online do not generally meet the quality standards

available from manually curated sources there are efforts afoot to gather scientists

and “crowd source” an improvement in the quality of available data. This article will

discuss the types of public compound databases available online, provide a series of

example databases and focus on the benefits and disruptions associated with the

increased availability of such data and integrating technologies to data-mine the

available information.

Keywords Public databases, chemical structure databases, Open Data,

chemoinformatics, data mining, internet chemistry, Wikis, blogs,

Page 2 of 37

Introduction

The internet is likely used on a daily basis by the majority of scientists. There

is little doubt that the web is the primary portal to query for information and data

and, when coupled with the intranet services for most companies, is the tool of

choice for most general searches. For many years the search for scientific-related

information would start at the library and commonly engage skilled professionals in

the domain of searching. These people would have a deep understanding of

navigating the plethora of databases and resources, using their own query

languages, and would perform searches using for-fee resources. While such skills

remain of value most scientists conduct the majority of their own searches and

certainly utilize their access to a no-cost, intuitive and expansive internet of

information. There has been a tremendous growth in scientific internet resources and

there are enormous opportunities provided by such facile access to chemistry

information and data.

Bioinformatics certainly established the trend of providing online access to

data and Chemistry, in many ways, is far behind. Open-access databases such as

GenBank [1] and the Protein Data Bank (PDB) [2] have been assisting biologists to

translate gene and protein sequences into biological relevance for over two decades.

It is possible that the differences in efforts results from publishers in Chemistry

discouraging the open flow of data and information. This is true not only for scientific

articles but also for chemistry databases. With the changing expectations of society

in terms of freedom of access to information, and the efforts of many evangelists

http://www.ncbi.nlm.nih.gov/Genbank/

http://www.wwpdb.org/

Page 3 of 37

and groups, a shift towards both free and open access (vide infra) chemistry-related

information is well underway and is likely to accelerate.

Murray-Rust envisages a world in which all scientific information is instantly

available [3•]. This emerging world of e-science or cyberscholarship seeks “to

develop the tools, content and social attitudes to support multidisciplinary,

collaborative science. Its immediate aims are to find ways of sharing information in a

form that is appropriate to all readers.” This article will discuss the work already

underway to support this noble and valid effort to provide enhanced public access to

Chemistry data and specifically focus on public chemical compound databases.

There are many tens of indexes of chemistry databases available online and

the reader is encouraged to perform one or more generic searches on “chemistry

databases” to retrieve a list of related information. The authors preferred source of

information is the Wiki hosted by Gary Wiggins [4•]. While the availability of freely

accessible information is clearly of value to scientists there are risks in terms of the

quality of information available. It is this quality issue which provides the

mainstream publishers, for the time-being, a foothold in the domain of providing

value-added access to scientific information. That said, public compound databases

especially have become a disruptive force for certain commercial bodies and the

threat has caused significant duress. The potential impact on the business models of

publishers and the increased capabilities and diversity of data within public

compound databases will also be highlighted.

Public Chemistry Databases

There are many freely available chemical compound databases on the web

and they assume many different forms. They can simply be a collection of chemical

structures aggregated into a single file and made available, gratis, for people to

Page 4 of 37

download and utilize as they see fit. These files are generally available in the form of

an SDF file [5] and can be downloaded and then imported to a database for

searching and viewing. There are literally hundreds of such files available online and

they are commonly available from chemical vendors in order to advertise their

catalog collections. These files generally contain the chemical identifiers in the form

of chemical names (systematic and trade) and registry numbers. The files can also

contain experimental or physical properties, file specific identifiers and pricing

information. There are aggregators who gather such files of chemical structures and

related information and assemble them into a single database and serve up to the

public (some examples will be discussed later). Since the files are assembled in a

heterogeneous manner the resulting data are plagued with inconsistencies and data

quality issues. Such an approach to gathering and merging data is a far cry from that

taken by commercial database vendors who manually gather and curate data. Some

examples of these commercial organizations are CAS [6], InfoChem [8] and Symyx

[9].

While the commercial databases offer curated data there is certainly a price-

barrier to accessing the information. A number of the free online resources are also

manually curated and, as will be discussed later, can offer as high a quality as the

commercial offerings. These resources are, however, constructed with a specific

focus in mind and therefore commonly number in the low thousands of structures

rather than the millions available in the larger online databases. Meanwhile, there

are a number of large online database resources offering access to valuable data and

knowledge. Some of these databases should be thought of as “linkbases”. For the

purpose of this article a linkbase is a repository of molecular connection tables

(chemical structures) linking out to various sources of data and associated

information. While it is impossible to be exhaustive within the confines of an article

Page 5 of 37

of this nature an overview of a number of online public compound databases focusing

specifically on free access databases will be provided.

The confusion around the differences between Open Access (OA) versus Free

Access (FA) continues to persist [9] but both offer an opportunity to help advance

science by facilitating the sharing of data, information and knowledge with no

barriers of price or access. The first major international statement on open access

was the Budapest Open Access Initiative (BOAI), in February 2002 [10]. The

definition of Open Access is as follows: “By 'open access' to this literature, we mean

its free availability on the public internet, permitting any users to read, download,

copy, distribute, print, search, or link to the full texts of these articles, crawl them

for indexing, pass them as data to software, or use them for any other lawful

purpose, without financial, legal, or technical barriers other than those inseparable

from gaining access to the internet itself. The only constraint on reproduction and

distribution, and the only role for copyright in this domain, should be to give authors

control over the integrity of their work and the right to be properly acknowledged

and cited.” [11]. Free Access is not equivalent to Open Access but a simple definition

has been suggested [12]: “Free access is access that removes price barriers but not

necessarily any permission barriers.” For the purpose of this article we are not only

interested in FA and OA but also Open Data.

Quoting from an online resource [13] “Open Data is a philosophy and

practice requiring that certain data are freely available to everyone, without

restrictions from copyright, patents or other mechanisms of control”. As yet there

are no commonly agreed upon definitions but as a result of Open Data evangelists

and groups progress is being made [14•,15••,16-18].

The majority of scientists cannot however differentiate between free access

and open access since both provide free access to information of value to them in

Page 6 of 37

their work. In a similar way, the majority of scientists do not care about the

distinctions between Open and Closed data. They utilize free access public chemical

compound databases on an as-needed basis, derive value from the content and

move on, not concerned whether the data posted online are Open or Closed.

Chemical Abstracts Services (CAS) [5] and their CAS Registry Numbers (RNs) [19]

have played a dominant role in managing a curated registry of chemical entities and

related chemical and biological literature. Their proprietary registration system does

not link to chemical structures in the public domain and their business model is at

risk [20••,21].

Before reviewing examples of public compound databases we should review

the issues of data quality. All content databases containing chemical compounds

contain errors. These errors can arise for a series of reasons including errors in

transcription, historical errors (a compound was “correct” when entered but later re-

characterized), issues with graphical representation and a plethora of other reasons.

The quality of chemical information in the public domain is generally quite low. This

does not mean that the data are not of value but that care needs to be taken in the

nature of the provider as an authority. There is, of course, no central body

responsible for the quality of data in the public domain. Databases of chemical

structure information such as PubChem [22••], ChemIDPLus [23] and ChemFinder

[24] etc., are commonly looked upon as authorities in terms of reliable information.

However, these sources are also aggregators of information and are at risk of

perpetuating errors form the original public data and depositions. Errors in structure-

identifier pairs are common [25] and inaccurate structure representations,

specifically in regards to stereochemistry, proliferate across many databases. A

definitive description of the challenges regarding quality in public domain databases,

and the rigorous processes required to aggregate quality data was provided by

Richards et al [26••]. During their assembly of the EPA DSSTox databases the

Page 7 of 37

assembled the chemical structures, chemical names and CAS Registry Numbers for

over 8000 chemicals from numerous toxicity databases. The data they extracted

were carefully curated and validated using multiple public information sources [27].

In regards to the quality of the chemical information presented with bioassay

data on PubChem Richards cautioned 'user beware' [26]. Since the chemical

structure content is deposited without additional review the user is at risk. Errors in

chemical names are common, and multiple structure errors have been identified.

Richards encourages users to make informed judgments on the quality of data based

on prior knowledge of the data submitter. The responsibility for the quality of the

PubChem database therefore rests with the depositors primarily and, as many of

these are commercial chemical vendors, their focus on quality is far less than the

stringent expectations of the community. The proliferation of errors from PubChem

into other databases has been identified [28] and a definitive effort to cleanse the

errors from the data, be it in regards to chemical structures, names or identifiers, is

going to be required. The efforts of groups such as the ChemSpider team with their

online curation [29] offers an opportunity to dramatically improve the quality of the

data through both a roboticized cleansing approach and manual examination by

many users. Efforts such as these should help reduce errors and result in the

proliferation of more validated information.

Public Compound Databases

PubChem

The highest profile online database is certainly PubChem [22]. Launched by

NIH in 2004 to support the New Pathways to Discovery component of their roadmap

initiative [30]. PubChem archives and organizes information about the biological

activities of chemical compounds into a comprehensive biomedical database and is

Page 8 of 37

the informatics backbone for the initiative, intended to empower the scientific

community to use small molecule chemical compounds in their research.

PubChem consists of three databases (PubChem Compound, PubChem

Substance, and PubChem Bio-Assay) connected together. PubChem Compound

contains 18 million unique structures and provides biological property information for

each compound. PubChem Substance contains records of substances from depositors

into the system. These are publishers, chemical vendors, commercial databases and

other sources. The PubChem Compound database contains records of individual

compounds (see Figure 1). PubChem BioAssay contains information about bioassays

using specific terms pertinent to the bioassay.

PubChem can be searched by alphanumeric text variables such as names of

chemicals, property ranges or by structure, substructure or structural similarity. As

of December 2007 its content is approaching 38.7 million substances and 18.4

million unique structures. Such a source of data opens up new possibilities [31] in

regards to data mining and extraction. Zhou et al [32•] concluded that the system

has an important role as a central repository for chemical vendors and content

providers enabling evaluation of commercial compound libraries and saving

biomedical researchers from the work associated with gathering and searching

commercial databases. They identified that over 35% of the 5 million structures from

chemical vendors or screening centers found in the PubChem database currently are

not present in the CAS registry.

PubChem continues to grow in stature, content and capability. The bioassay

data resulting from the NIH Roadmap initiative is likely to continue to grow and

PubChem will assume a prominent role in distributing the data in a standard format.

Despite the obvious value of PubChem the platform has caused quite a furor in

recent years including debates regarding the position of CAS relative to the resource.

The reader is referred elsewhere for commentaries [33,34]. Others have commented

Page 9 of 37

on the quality of the data content within PubChem. Shoichet [35••] believes that the

screening data are less rigorous than those in peer-reviewed articles, and contain

many false positives. Shoichet worries that chemists who use PubChem may be sent

on a wild goose chase. Numerous problems arise from the quality of submissions

from various data sources and there are thousands of errors in the structure-

identifier associations due to this contamination and this can lead to the retrieval of

incorrect chemical structures. It is also common to have multiple representations of

a single structure due to incomplete or total lack of stereochemistry for a molecule

[36].

DSSTox

The EPA Distributed Structure-Searchable Toxicity (DSSTox) database project

[38,39] provides a series of documented, standardized and fully structure-annotated

files of toxicity information [40]. The initial intention for the project was to deliver a

public central repository of toxicity information to allow for flexible analogue

searching, SAR model development and the building of chemical relational

databases. In order to ensure maximum uptake by the public and allow users to

integrate the data into their own systems the DSSTox project adopted the use of the

common standard file format (SDF) to include chemical structure, text and property

information. The DSSTox databases was also deployed online to provide free public

access to the data files without the dependency on a desktop software package for

querying and managing the data files. The overall aims of the project, to deeply

integrate chemical structure information with existing toxicity data and to facilitate

interrogation of the data have been achieved. The DSSTox datasets are among the

most highly curated public datasets available and likely the reference standard in

publicly available structure-based toxicity data.

Page 10 of 37

eMolecules

eMolecules [41] offers a free online database of almost 8 million unique

chemical structures. The database is assembled from data supplied by over 150

suppliers and provides a path to identifying a vendor for a particular chemical

compound. By providing access to compounds for purchase they are providing a free

access online service similar to those of commercial databases such as Symyx

Available Chemical Directory [42], CAS’ ChemCats [43] and Cambridgesoft’s

ChemACX [44] as well as a number of other providers. The system offers access to

more than 4 million commercially available screening compounds and many tens of

thousands of building blocks and intermediates. Their database was recently

enhanced by providing access to NMR, MS and IR spectra from Wiley-VCH [45] for

over 500,000 compounds via ChemGate [45], a fee-based service. eMolecules also

provides links to many sources of data for spectra, physical properties and biological

data including include the NIST WebBook [46], the National Cancer Institute [47],

DrugBank [48•] and PubChem.

eMolecules is presently fairly limited in its scope and primarily offers a very

useful path to the purchase of chemicals and links to the more popular government

databases. Nevertheless, the site is popular with chemists who are searching for

chemicals and the interface is intuitive and easy to use, a key element in attracting

users.

DrugBank

DrugBank [48•] is a manually curated resource assembled from the collection

information of a series of other public domain databases and enhanced with

Page 11 of 37

additional data generated within the laboratories of the hosts. The database

aggregates both bioinformatics and cheminformatics data and combines detailed

drug data with comprehensive drug target (i.e. protein) information. The database is

hosted by the University of Alberta, Canada. Version 1 of the database, released in

2006, contained >4100 drug entries including >800 FDA approved small molecule

and biotech drugs as well as >3200 experimental drugs. Over 14,000 protein or drug

target sequences were linked to these drug entries. Each record in the database,

known as a DrugCard, has >80 data fields. The information is split into

drug/chemical data and drug target or protein data and many data fields are linked

to other databases (KEGG [49], PubChem, ChEBI [50], PDB [2] and others). The

database supports extensive text, sequence, chemical structure and relational query

searches.

DrugBank has been used to facilitate in silico drug target discovery, drug

design, drug docking or screening, drug metabolism prediction, drug interaction

prediction and general pharmaceutical education. The version 2.0 release of

DrugBank [51••] released in January of this year with over 800 new drug entries and

each DrugCard entry extended to include over 100 data fields with half of the

information being devoted to drug/chemical data and the other half devoted to

pharmacological, pharmacogenomic and molecular biological data. They have started

to add experimental spectral data (NMR and MS specifically), and have expanded the

coverage to nutraceuticals and herbal medicines.

The Drugbank team also host the Human Metabolome Database (HMDB)

[52], a database containing nformation about small molecule metabolites found in

the human body. The database is used by scientists working in the areas of

metabolomics, clinical chemistry and biomarker discovery. The database currently

contains nearly 3000 metabolite entries and each MetaboCard entry contains more

Page 12 of 37

than 90 data fields devoted to chemical, clinical data, enzymatic and biochemical

data.

NMRShiftDB

The NMRShiftDB is an open source collection of chemical structures and their

associated NMR shift assignments [53•,54]. The database is generated as a result of

contributions by the public and currently contains over 20,000 structures with

>220,000 assigned carbon chemical shifts. Datasets entered by contributors are sent

to registered reviewers for evaluation. A significant part of NMRShiftDB was initially

assembled from in-house databases from collaborating institutions and were entered

unchecked. This called for external checks of the data based on independent

databases and resources and these have now been carried out by two specific groups

[56,57]. Williams et al. [56] performed a cursory examination of the structural

diversity within the database and concluded that the data represented a statistically

relevant set to use in an evaluation of predictive accuracy and demonstrated that the

quality of the data is rather impressive. This effort shows the advantages of

providing a set of Open Data for reuse and examination and the benefits of having

many scientists examine, validate and correct. The benefit is possible for any

database allowing its users to qualify, annotate and correct its data.

ChemSpider

ChemSpider was released to the public in March 2007 with the intention of

“building a structure centric community for chemists”. ChemSpider has grown into a

resource containing almost 18 million unique chemical structures and recently shared

its data with PubChem providing about 7 million unique compounds. The data

sources have been gathered from chemical vendors as well as commercial database

Page 13 of 37

vendors and publishers and members of the Open Notebook Science community.

ChemSpider has also integrated the SureChem patent database [59] collection of

structures to facilitate links [60] between the systems. The database can be queried

using structure/substructure searching and alphanumeric text searching of both

intrinsic as well as predicted molecular properties. They have recently added virtual

screening results using the LASSO similarity search tool [61] to screen the

ChemSpider database against all 40 target families from the Database of Useful

Decoys (DUD) dataset.

ChemSpider has enabled unique capabilities relative to the primary public

chemistry databases. These include real time curation of the data, association of

analytical data with chemical structures, real-time deposition of single or batch

chemical structures (including with activity data) and transaction-based predictions

of physicochemical data. The ChemSpider developers have made available a series of

web services to allow integration to the system for the purpose of searching the

system as well as generation of InChI identifiers and conversion routines.

The system also integrates text-based searching of Open Access articles and

presently search over 50,000 OA Chemistry articles, soon to be extended to 150,000

articles. The index is expected to increase dramatically as they extract chemical

names from OA articles and convert the names to chemical structures using name to

structure conversion algorithms. These chemical structures will be deposited back to

the ChemSpider database thereby facilitating structure and substructure searching in

concert with text-based searching.

ChemSpider has a focus on, and commitment to, community curation. The

social community aspects of the system demonstrate the potential of this approach.

The team have committed to the release of a wiki-like environment for further

annotation of the chemical structures in the database, a project they term

WiChempedia. They will utilize both available Wikipedia content and deposited

Page 14 of 37

content from users to enable the ongoing development of community curated

chemistry.

Other Databases

The list of databases and resources reviewed above is only representative of

the type of information available online. Other highly regarded databases frequented

by this author include the Chemical Structure Lookup Service (with over 36 million

unique structures) [64], CrystalEye [65], KEGG [49] and CheBI [50]. There are also

many other resources available and the reader is referred to one of the many

indexes of such databases available on the internet to identify potential resources of

interest [4,66].

Public Compound Databases versus Commercial Databases

The creation, hosting and support of a curated chemical compound database

with integrated content is an expensive enterprise. Historically these databases have

been built as a result of hundreds if not thousands of man years of rigorous and

exacting human effort and then, for some of the original founders in this domain,

migrated onto computer systems. In the development of these systems host

organizations have created sizeable revenues and estimated annual fees for

accessing this information via just a few organizations likely exceeds half a billion

dollars. With the advances in technology accompanying the internet boom the

hosting of large databases, the text-based searching of immense amounts of data

and the ability to disseminate complex forms of graphical information via standard

protocols provided an opportunity created for disruptive offerings in this domain.

They soon arrived.

The primary advantage of commercial databases is that they have been

manually examined by skilled curators, addressing the tedious task of quality data-

Page 15 of 37

checking. Certainly the aggregation of data from multiple sources, both historical and

modern, from multiple countries and languages and from sources not available

electronically are significant enhancements over what is available via an internet

search. The question remains how long will this remain an issue? Scientists working

in new areas of science and domains of expertise reflect on the most recent

literature in general. Can you imagine a search about the semantic web being

conducted just a few years ago? What about metabonomics or even genomics?

Certain areas of the scientific literature, while still of high value, can become

antiquated fairly quickly. With the new capabilities of internet-based searching and

direct access to abstracts for the majority of publishers even a rudimentary text

search can expose articles previously unavailable except through an abstracting

service. Search engines will increasingly be utilized for first level searches specifically

because they are simple to use, they are fast and they are free. With chemically

searchable patents also available online [59,67], at no charge, the landscape for

scientists searching for information is more open than ever. If there are data of

interest to be located then internet search engines will enable it.

The premier curated database offerings of today have an interesting if not

challenging future ahead of them. Their value-added enhancements of the

distributed data must be significant enough to warrant an investment in their

services [68]. As expressed earlier the quality of the data resulting from curation is

significant but this author questions the longevity of that distinguishing factor

moving forward. Roboticized recognition and conversion of chemical names to

chemical structures can dramatically shift this domain and efforts have already been

demonstrated in applications to patents and publications. Should the quality reach a

sufficient standard then today’s publishers business models will definitely be at risk.

The Future of Public Compound Databases

Page 16 of 37

The semantic web [69] is already offering us the chance to connect,

simultaneously interrogate and mash-up the results of searching multiple public

compound databases simultaneously. An enormous diversity of data is already

available for interrogation by the public and continues to expand daily. This author

remains concerned with the very real quality issues associated with public data sets.

While the utopian dream of no errors in freely available data cannot be met the push

towards more Open Data without consideration being given to both manual and

robotic curation could be risky to those using the data. Real-time curation of data

within public compound databases is feasible [29] and certainly Wikipedia is a model

of crowd sourcing [71] to build, curate and maintain a quality database.

Unfortunately, even these world-renowned platforms actually sit on the shoulders of

a very few dedicated individuals, relative to the users, who care about quality. There

is no simple solution to the issues of quality and it will persist for the foreseeable

future until processes, procedures and momentum to resolve the issues are

established.

Even in its earliest form PubChem has been referred to, tongue-in-cheek, as

“the granddaddy of all free chemistry databases”. Certainly it presently holds the

premier position in reputation, capabilities and connectivities built on a database of

chemical structures and linked out to biological assay data, the PubMed database

and an array of services to facilitate both the distribution of the data and the wealth

of tools developed to support the system. The majority of databases discussed in this

article now uses two primary identifiers in their systems – the CAS registry number

and a PubChem ID number. This alone indicates a shift in equality of commercial

versus public compound repositories. For now, PubChem remains focused on its

initial intent to support the National Molecular Libraries Initiative. The data within

PubChem have never formally been declared as Open Data but are assumed to be

Page 17 of 37

available in that manner and thereby offer to scientists a valuable aggregate of data

for the purpose of data mining and discovery.

At the time of writing the newest addition to the proliferating domain of public

chemical compound databases is the ChemSpider Database [57], working to “Build a

Structure Centric Community for Chemists”. This system presently offers a series of

unique capabilities which might become trend-setting for present and future

databases. As discussed earlier these include the user deposition of structures, real-

time annotation and curation of data, management of analytical data and online

transaction services. It is this authors’ belief that such capabilities will likely become

standard for the majority of most public chemical compound databases in the near

future. These types of capabilities could help establish the newfound shift to Open

Notebook Science and shift the bias from the chemical biology databases (PubChem,

Drugbank, HMDB and DSSTox) to even provide an environment for non-life science

chemists, polymer chemists and material scientists to manage and research

information of interest to them.

The WikiSphere, Blogosphere and Internet as a Public Compound Database.

Wikis and blogs are common terms now for the majority of users of the

worldwide web and both are fast becoming chosen platforms for the exchange of

information between many scientists, not only as tools within their own research

groups but, more generally, with the public in general. A blog, or weblog is a website

where entries are written in chronological order and generally provide commentary

or news on a particular subject [71]. A typical blog combines text, images and links

to other blogs, web pages, and other media related to its topic. The original blog

posting remains untouched by the commenter and readers are free to add their

comments, generally in a mediated manner where the blog host retains control over

Page 18 of 37

the postings. An example screenshot from a chemistry-based blog hosted with the

intention of examining and discussing organic syntheses is shown in Figure 3. The

number of chemistry-related blogs continues to grow dramatically and there have

been efforts to provide a unified view into some of these [72,73].

A wiki is a type of computer software that allows users easily to create, edit

and link web pages and enables documents to be written collaboratively, in a simple

markup language using a web browser, and is essentially a database for creating,

browsing and searching information. Certainly Wikipedia is the most well-known

today though there are many others already online and used within the confines of

an organization to manage content. There are active groups supporting the

development of chemistry on Wikipedia and there are now thousands of pages

describing small organic molecules, inorganics, organometallics, polymers and even

large biomolecules. Focusing on small molecules in general, each one has a Drug Box

[75] or a Chemical infobox [76]. A drug box provides identifier information

(chemical name, registry number, and so on) and commonly the identifiers link out

to a related resource. Chemical data, pharmacokinetic data and therapeutic

considerations can also be listed. At present there are approximately 8000 articles

with a chembox or drugbox [3], with between 500-1000 articles added since May.

The detailed information offered on Wikipedia regarding a particular chemical or drug

can be excellent [77], see Figure 2, or weak [78]. There are many dedicated

supporters and contributors to the quality of the online resource. Drug and

chemboxes have been shown to contain errors but the advantage of a wiki is that

changes can be made within a few keystrokes and the quality is immediately

enhanced. The opposite is also true and vandalism can occur. This community

curation process makes Wikipedia a very important online chemistry resource whose

impact will only expand with time.

Page 19 of 37

Wikis have recently been used as the basis of Open Notebook Science [79].

The UsefulChem Wiki [80] includes a series of experimental pages commonly linked

to related blog pages as shown in Figure 4. The Open Notebook Science efforts and

the movement appears to be gaining momentum with the support of vocal

advocates, such as Neylon [81], Murray-Rust [82] and many others.

While both wikis and blogs are very valuable for information exchange, what

they enable in terms of text and image exchange is all but crippled in terms of

searching by many chemists’ additional query needs for chemical structures,

reactions and data. Neither Wikis nor blogs, as yet, are enabled for the purpose of

structure and substructure searching and, therefore, remain isolated, in general,

from cheminformatics based search procedures. One of the key developments which

has already facilitated the Semantic Web for chemistry is the InChI,[83] the

International Chemical Identifier. The InChI string is a textual identifier for chemical

substances designed to provide a standard and human-readable way to encode

molecular information (see Figure 5) and to facilitate the search for such information

in databases and on the web. The InChI string, unfortunately, has only partly delivered on

the promise of facilitating web-based searches, due to unpredictable breaking of InChI

character strings by search engines. In order to resolve this issue the InChIKey was

introduced. The condensed, 25 character InChIKey is a hashed version of the full InChI

and is not human-readable. The equivalent InChIKey for the InChIString of L-ascorbic

acid is CIWBSHSKHKDKBQ-JLAZNSOCBT. The advantage of the key is one of

enabling web searches, but a lookup table to identify the associated structure, or reference

Page 20 of 37

to the original InChI String, is necessary [85]. While tens of millions of InChI strings

and keys have been populated into databases, their value is still in its infancy.

Publishers have started to embed InChIs into their articles and the Royal Society of

Chemistry [85] is presently pioneering a new publishing model, Project Prospect,

including InChI to demonstrate movement toward the semantic web for chemistry.

Bloggers have started to use InChI Strings and Keys on their postings, and wiki-

pages are being InChI-enabled to help the web become structure searchable. The

necessity of a central lookup facility for published InChIStrings will be necessary in

order to facilitate substructure searching of the web but this capability is likely to be

developed in the near future. Willighagen already aggregates InChI Strings onto a

blog [87].

BioSpider [88] users are able to type in almost any kind of biological or

chemical identifier (protein/gene name, sequence, accession number, chemical

name, brand name, SMILES string, InCHI string, CAS number, etc.) and delivers a

report about the biomolecule. BioSpider uses a web-crawler to scan through dozens

of public databases and employs a variety of specially developed text mining tools

and locally developed prediction tools to find, extract and assemble data for its

reports. A summary includes physico-chemical parameters, images, models, data

files, descriptions and predictions concerning the query molecule.

An increasing number of public databases will continue to become available

but the challenge, even now, is how to integrate and access the data. The

implementation of InChIs for web-based searching [89], and the delivery of

userscripts to aggregate information and computational results from different web

resources [90] are bringing together internet resources to appear as a single

monolithic public chemistry database. Willighagen et al. [90] use userscripts to

Page 21 of 37

enrich biology and chemistry related web resources by incorporating or linking to

other computational or data sources on the web. They showed how information from

web pages can be used to link to, search, and process information in other resources

thereby allowing scientists to select and incorporate the appropriate web resources

to enhance their productivity. Such tools connecting open chemistry databases and

user web pages is an ideal path to more highly integrated information sharing.

Conclusion

There is little doubt that the newfound availability of public chemical

compound databases with their associated chemistry and biological data is enabling

scientists to access information at less cost in both time and currency. The increasing

quantity of freely accessible and integrated data can speed decision making and

bring clarity or alternatively inundate and saturate the user with poor quality

information. Scientists now have free access to structure-searchable patents, open

and free access peer-reviewed publications and software tools for the manipulation

of chemistry related data. Members of the Open Source movement are developing

toolkits including visualization and data-mining tools and, when coupled with the

public chemistry databases reviewed here, will likely benefit the process of discovery.

There are likely to be challenging times ahead in terms of meshing the needs of

commercial database publishers versus proliferation of free databases but this

journey will not be halted by the objections of the commercial entities provided that

legal copyrights are respected and the shift towards a more open community for

science persists.

Acknowledgements

The author wishes to thank the following people: Stephen Bryant and Evan Bolton

from the PubChem team, the IUPAC/National Institute of Standards and Technology

Page 22 of 37

InChI team (Alan McNaught, Stephen Stein, Stephen Heller, Dmitrii Tchekhovskoi);

David Wishart and Nelson Young (Drugbank and HMDB), Nicko Goncharoff

(SureChem), Stephen Boyer (IBM), Marc Nicklaus (Chemical Structure Lookup

Service), members of the ChemSpider Advisory Group (Egon Willighagen, Sean

Ekins, Joerg Wegner and Alex Tropsha specifically), Ann Richard and Marti Wolf

(DSSTox), Christoph Steinbeck (NMRShiftDB), Nick Day and Peter Murray-Rust

(CrystalEye), Martin Walker, Andrew Yeung and Dirk Beestra (Wikipedia Chemistry).

I would also like to acknowledge the many contributors to the blogging discussions

about Open and Free Access.

References

1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic

Acids Res. (2007) 35(Database issue):D21-5.

2. Berman HM, Henrick K, Nakamura H: Announcing the worldwide Protein Data

Bank. Nature Structural Biology (2003) 12: 980

3. Murray-Rust P: Chemistry for everyone. NATURE (2008) 451, 648-651

•Provides a vision for the future of data distribution, access and integration across

the worldwide web and espouses the need for Open Data policies and adoption of the

Semantic Web.

4. Gary Wiggins’ Wiki. CHEMBIOGRID, Chemistry Databases on the Web:

Alphabetical:http://cheminfo.informatics.indiana.edu/cicc/cis/index.php/Chemistry_D

atabases_on_the_Web_%28Alphabetical_List%29

Classified:http://cheminfo.informatics.indiana.edu/cicc/cis/index.php/Chemistry_Dat

abases_on_the_Web_%28Classified_List%29

http://cheminfo.informatics.indiana.edu/cicc/cis/index.php/Chemistry_Databases_on_the_Web_%28Alphabetical_List%29

http://cheminfo.informatics.indiana.edu/cicc/cis/index.php/Chemistry_Databases_on_the_Web_%28Alphabetical_List%29

Page 23 of 37

•An aggregation of chemistry databases, curated and annoted, to provide

significantly more information than would be returned in a generic search of the

internet.

5. Symyx: CTFile formats no-fee. (2008)

http://www.mdli.com/downloads/public/ctfile/ctfile.jsp

6. CAS: Chemical Abstract Services, Columbus, OH, USA (2006).

http://www.cas.org/

7. InfoChem: InfoChem Gesellschaft für Chemische Information, München,

Germany (2008). http://infochem.de/

8. Symyx: Santa Clara, California, USA (2008). http://www.symyx.com/

9. The University’s Mandate To Mandate Open Access: Harnad S, (2008)

http://openaccess.eprints.org/index.php?/archives/358-The-Universitys-Mandate-To-

Mandate-Open-Access.html

10. Open Access: Wikipedia Article on Open Access. (2008)

http://en.wikipedia.org/wiki/Open_access

11. The BOAI FAQ page: Frequently Accessed Questions about the Budapest Open

Access Initiative (2008), http://www.earlham.edu/~peters/fos/boaifaq.htm

12. Williams AJ: A perspective of Publicly Accessible/Open Access Chemistry

Databases: Drug Discovery News (2008), accepted for publication

13. Open Data: Wikipedia Article on Open Data. (2008)

http://en.wikipedia.org/wiki/Open_data

14. Murray-Rust P, Rzepa HS, Tyrrell SM and Zhang Y: Representation and use of

Chemistry in the Global Electronic Age ChemInform, 36(15), (2005)

• An excellent outline regarding the potential of combining open access and the

semantic web in chemistry. Rzepa and Murray-Rust are two of the evangelists of this

domain and outline in this article how data may be interconnected to the benefit of

all chemists.

Page 24 of 37

15. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck C,

Wegner J , Willighagen EL: The Blue Obelisk-Interoperability in Chemical

Informatics, J Chem Inf Model, (2006) 46 (3), 991-998.

••The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a

group of scientists and developers supporting open source software development,

consistent and complimentary chemoinformatics research, open data, and open

standards in Chemistry.

16. CODATA, The Committee on Data for Science and Technology: CODATA,

Paris, France (2008). http://www.codata.org/

17. An Introduction to Science Commons: Wilbanks J, Boyle J, (2006).

http://sciencecommons.org/wp-

content/uploads/ScienceCommons_Concept_Paper.pdf

18. The Open Knowledge Foundation: Protecting and Promoting Open Knowledge

in a Digital Age (2008). http://www.okfn.org/

19. CAS Registry Numbers: Chemical Abstract Services, Columbus, OH, USA

(2008). http://www.cas.org/expertise/cascontent/registry/regsys.html

20. Murray-Rust P, Mitchell JB, Rzepa HS: Communication and re-use of

chemical information in bioscience. BMC Bioinform (2005) 6:180-196.

•• Provides an overview of chemical information on the Internet and, while slightly

outdated, is an important read in regards to the challenges and the vision of a

Semantic Web for Chemistry.

21. Heller SR, Stein SE, Tchekhovskoi DV: Open source/open access/open data

and the IUPAC International Chemical Identifier - InChI. American Chemical

Society National Meeting, Washington, DC, USA (2005):CINF-60.

22. NCBI: PubChem: National Center for Biotechnology Information, Bethesda, MD,

USA (2008). http://pubchem.ncbi.nlm.nih.gov

Page 25 of 37

•• Pubchem is a large data aggregator (nearing 20 million structures) and offers

relational searching capabilities via text, structure and substructure searching and

access to the entire dataset via download of SDF files. A series of services for the

handling of chemistry databases are also available via the website.

23. ChemIDplus: National Library of Medicine, Bethesda, MD, USA (2008).

http://chem.sis.nlm.nih.gov/chemidplus/chemidheavy.jsp

24. ChemFinder.com: CambridgeSoft Corp, Cambridge, MA, USA (2008).

http://chemfinder.cambridgesoft.com/

25. Hacking Pubchem - Technology easy, Quality difficult: Williams AJ, (2007)

http://www.chemspider.com/blog/hacking-pubchem-technology-easy-quality-

difficult.html.

26. Richard AM, Swirsky Gold

L, Nicklaus MC: Chemical structure indexing of

toxicity data on the Internet: Moving toward a flat world. Current Opinion in

Drug Discovery & Development (2006) 9(3): 314-325.

•• The review discusses efforts to gather, curate and make publicly available

toxicology-related chemical information. The specific discussions regarding the

quality issues with public chemistry databases and efforts to produce clean quality

databases are noteworthy.

27. DSSTox Quality Chemical Information Review Procedures: US

Environmental Protection Agency, Washington, DC, USA (2008).

http://www.epa.gov/nheerl/dsstox/ChemicalInfQAProcedures.html

28. PubChem Errors: Williams AJ, PubChem Meeting, Washington DC: (2007)

http://www.chemspider.com/docs/PubChem_at_ChemSpider_Overview_SLides_Sept

ember_2007.pdf

29. The Process of Curating Identifiers on ChemSpider: Williams AJ, (2008)

http://www.chemspider.com/docs/The_Process_of_Curating_Identifiers_on_ChemSpi

der.pdf

Page 26 of 37

30. The NIH Roadmap Initiative: Office of Portfolio Analysis and Strategic

Initiatives, National Institutes of Health, Bethesda, Maryland 20892: (2008)

http://nihroadmap.nih.gov/

31. Hacking PubChem: Why The Open Access Fight is Just the Beginning,

Apodaca R, (2006), http://depth-first.com/articles/2006/09/22/hacking-pubchem-

why-the-open-access-fight-is-just-the-beginning

32. Zhou Y, Chen K, Yan SF, King FJ, Jiang S, Winzeler EA: Large-Scale

Annotation of Small-Molecule Libraries Using Public Databases. J. Chem. Inf.

Model. (2007) 47:1386-1394

•• The 2.5 million compound collection at the Genomics Institute of the Novartis

Research Foundation (GNF) was used as a model to determine whether automated

annotation of screening hits in batch is feasible.

33. The American Chemical Society and NIH’s PubChem, Reshaping Scholarly

Communication Blog: (2008)

http://osc.universityofcalifornia.edu/news/acs_pubchem.html

34. Background of the PubChem/CAS Issue: (2008)

http://www.arl.org/bm~doc/backgroundfaqpb.pdf

35. Baker M: Open-access chemistry databases evolving slowly but not

surely:Nature Reviews, Drug Discovery, (2006) 5:707-708

• A critical review of how far publicly available initiatives have to go to catch up with

commercial offerings.

36. How big is the challenge of curation and what is the structure of

Ginkgolide-B: Antony Williams (2008), http://www.chemspider.com/blog/how-big-

is-the-challenge-of-curation-and-what-is-the-structure-of-ginkgolide-b.html

Page 27 of 37

37 DSSTOX: Distributed Structure-Searchable Toxicity (DSSTox) Database:

US Environmental Protection Agency, Washington, DC, USA (2006).

http://www.epa.gov/nheerl/dsstox/

38. Richard AM and Williams CR (2002) Distributed Structure-Searchable

Toxicity (DSSTox) Public Database Network: A Proposal, Mutation Research:

New Frontiers, 499:27-52.

39. Richard AM: DSSTox web site launch: Improving public access to

databases for building structure-toxicity prediction models, Preclinica, (2006)

2(2):103-108.

40. DSSTox Data Files: http://www.epa.gov/ncct/dsstox/DataFiles.html

41. eMolecules Online Service: eMolecules, Del Mar, CA, USA (2008).

http://www.emolecules.com

42. Available Chemical Directory: Santa Clara, California, USA (2008).

http://www.mdli.com/products/experiment/available_chem_dir/index.jsp

43. ChemCats: Chemical Abstract Services, Columbus, OH, USA (2006).

http://www.cas.org/expertise/cascontent/chemcats.html

44. ChemACX: CambridgeSoft Corp, Cambridge, MA, USA (2008).

http://www.cambridgesoft.com/databases/details/?db=12

45. ChemGate: Tony Davies, eMolecules and Spectroscopy: Spectroscopy Europe,

(2007) 19(1):27-28

46. The NIST Chemistry WebBook: (2008) http://webbook.nist.gov/chemistry/

47. NCI/NIH Developmental Therapeutics Program: National Cancer Institute,

Frederick/National Institutes of Health, Bethesda, MD, USA. (2008).

http://dtp.nci.nih.gov/index.html

48. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z,

Woolsey J: DrugBank: a comprehensive resource for in silico drug discovery

and exploration, Nucleic Acids Res. (2006) 34:D668-72

Page 28 of 37

• A detailed description of the intent, development and capabilities of the Drugbank

database, one of the most respected public chemistry databases utilized by drug

discovery scientists today.

49. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: KEGG: The KEGG

resource for deciphering the genome, Nucleic Acids Res. (2004) 32 (Database

issue):D277-80

50. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A,

Alcántara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology

for chemical entities of biological interest, Nucl. Acids Res. (2008) 36: D344-

D350;

51. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B,

Hassanali M: DrugBank: a knowledgebase for drugs, drug actions and drug

targets, Nucleic Acids Res. (2008) 36(Database issue):D901-6.

•• An update regarding the DrugBank database as it is released in its Version 2

state.

52. HMDB: The Human Metabolome Database. Nucleic Acids Res. (2007) 35:

D521-6

53. Steinbeck C, Krause S, Kuhn S: NMRShiftDB– Constructing A Chemical

Information System With Open Source Components. J. Chem. Inf. Comput. Sci.

(2003) 43:1733-1739.

•The defining article regarding the development of the NMRShiftDB database defining

the intention of the work, the development of the software components and a vision

of how such a platform can lead to widespread dissemination of analytical data, at

no-charge, to the chemistry community.

Page 29 of 37

54. Steinbeck C, Kuhn S. NMRShiftDB – Compound Identification And

Structure Elucidation Support Through a Free Community-Built Web

Database. Phytochemistry, (2004), 65:2711–2717.

55. Blinov KA, Smurnyy YD, Elyashberg ME, Churanova TS, Kvasha M, Steinbeck C,

Lefebvre BA, Williams AJ: Performance Validation of Neural Network Based

13C NMR Prediction Using a Publicly Available Data Source. J Chem Inf

Model, (2008), Accepted for publication, doi: 10.1021/ci700363r.

56. CSEARCH and NMRShiftDB: Robien W (2007)

http://nmrpredict.orc.univie.ac.at/csearchlite/enjoy_its_free.html

57. Williams AJ, ChemSpider and Its Expanding Web: Building a Structure-

Centric Community for Chemists, Chemistry International (2007) 30(1): 30.

58. Open Notebook Science: Bradley JC, (2006) Drexel CoAs E-Learning Blog,

http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html

59. SureChem: San Francisco, CA, USA (2008) http://www.surechem.org/

60. Free Access Structure Searching of Patents: Williams AJ (2007),

http://www.chemspider.com/docs/Structure_Searching_of_Patents_Using_ChemSpid

er.pdf

61. LASSO: Ligand Activity in Surface Similarity Order, SioBioSys Inc., Toronto,

Canada. http://www.simbiosys.ca/ehits_lasso/index.html

62. Database of Useful Decoys: http://dud.docking.org./

63. WiChempedia: ChemSpider Blog (2007)

http://www.chemspider.com/blog/wichempedia-is-now-on-its-way.html

64. Chemical Structure Lookup Service: National Institutes of Health,

http://cactus.nci.nih.gov/cgi-bin/lookup/search

65. CrystalEye Crystallogrpahic Database:

http://wwmm.ch.cam.ac.uk/crystaleye/

Page 30 of 37

66. Thirty Two Free Chemistry Databases: Apodaca R, Depth-First Blog,

http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases

67. IBM’s Online Patent Search: (2008) IBM Chemical Search Alpha, IBM,

Almaden Services Research, San Jose, CA 95120, USA,

https://chemsearch.almaden.ibm.com/chemsearch/SearchServlet

68. Kemper K, Chemical Abstracts still developing ways to help its core –

scientists, Columbus Business First,

http://columbus.bizjournals.com/columbus/stories/2007/06/18/story20.html?page=

1

69. Feigenbaum L, Herman I, Hongsermeier T, Neumann E, Stephens S: The

Semantic Web in Action, Scientific American Magazine

http://www.sciam.com/article.cfm?id=the-semantic-web-in-action

70. The Benefits of Crowdsourcing: http://en.wikipedia.org/wiki/Crowdsourcing

71. The Definition of a Blog: http://en.wikipedia.org/wiki/Blog

72. ScienceBlogs: http://scienceblogs.com/

73. Chemical BlogSpace: http://cb.openmolecules.net/

74. The Definition of a Wiki: http://en.wikipedia.org/wiki/Wiki

75. Wikipedia Chemical Drugbox: http://en.wikipedia.org/wiki/Template:Drugbox

76. Wikipedia Chemical Infobox:

http://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox

77. Taxol on Wikipedia: http://en.wikipedia.org/wiki/Taxol

78. AP7 on Wikipedia: http://en.wikipedia.org/wiki/AP7

Page 31 of 37

79. Bradley JC, Open Notebook Science Using Blogs and Wikis, Nature

Preceedings (2007) doi:10.1038/npre.2007.39.1,

http://precedings.nature.com/documents/39/version/1

80. UsefulChem Open Notebook Science: Bradley JC, Drexel University,

http://usefulchem.wikispaces.com/All+Reactions and http://usefulchem-

experiments1.blogspot.com/2006/05/exp-009.html

81. Open Notebook Science: Neylon C, Science in the open, An openwetware blog

on the challenges of open and connected science (2008)

http://blog.openwetware.org/scienceintheopen/2007/12/12/a-big-few-weeks-for-

open-notebook-science/

82. Open Notebook Science NMR: Murray-Rust P, A Scientist and the Web Blog

(2008) http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=671

83. The IUPAC International Chemical Identifier: (2008)

http://www.iupac.org/inchi/

84. The IUPAC International Chemical Identifier Software: (2008)

http://www.iupac.org/inchi/release102.html

85. Royal Society of Chemistry: (2008) http://www.rsc.org/

86. Project Prospect: (2008) RSC Publishing,

http://www.rsc.org/Publishing/Journals/ProjectProspect/

87. Chemical Blogspace, (2008) http://cb.openmolecules.net/inchis.php

88. Knox C, Shrivastava S, Stothard P, Eisner R, Wishart DS: BioSpider, A web

Server for Automating Metabolome Annotations. Pacific Symposium on

Biocomputing, (2007) 12:145-156.

Page 32 of 37

89. Cole SJ, Day NE, Murray-Rust P, Rzepa HS, Zhang Y: Enhancement of the

Chemical Semantic Web through INChIfication. Org Biomol Chem, (2005)

3:1832-1834

90. Willighagen EL, O'Boyle NM, Gopalakrishnan H, Jiao D, Guha R, Steinbeck C and

Wild DJ: Userscripts for the Life Sciences. BMC Bioinformatics, (2007) 8:487.

•• Discusses the use of userscripts to change the appearance of web pages by

modifying web content on the fly to enable aggregation of information and

computational results from different web resources into a single webpage. Indicative

of the future of integration and the possibilities which exist to gather information

from a multitude of resources and reformat and deliver to the consumer.

Page 33 of 37

Figures

Figure 1 - The Compound Summary Page for Taxol in PubChem. Page 1 only is

shown. (http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=36314)

Page 34 of 37

Figure 2: The DrugBox for Taxol from Wikipedia (http://en.wikipedia.org/wiki/Taxol)

http://en.wikipedia.org/wiki/Taxol

Page 35 of 37

Figure 3: The TotallySynthetic.com blog. Paul Docherty discusses complex

syntheses and offers readers an opportunity to comment, analyze and provide

feedback. Many articles are labeled with InChIKeys to allow indexing by search

engines. (http://totallysynthetic.com/blog/)

http://totallysynthetic.com/blog/

Page 36 of 37

Figure 4: An Example UsefulChem wiki page

(http://usefulchem.wikispaces.com/Exp148)

This UsefulChem wiki page shows a number of important content items: 1) Links to

the prior failed experiment; 2) Links to the docking results that justified making this

compound; 3) Full characterization (spectroscopy and photographs) of an isolated

product, with interactive NMRs (JSpecView/JCAMP-dx) of the starting materials; 4)

In the discussion section a question is posed by Professor Bradley to his student, and

then answered. The entire discussion history is captured. 5) A complete, detailed and

dated log of the steps taken by the student; 6) In the tag section, InChIs of every

compound used are provided for indexing by search engines.

http://usefulchem.wikispaces.com/Exp148

Page 37 of 37

O

OH OH

O

OH

OH InChI=1/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-10H,1H2/t2-,5+/m0/s1

CIWBSHSKHKDKBQ-JLAZNSOCBT

Figure 5: The InChI String (top) and InChI Key (bottom) for L-ascorbic acid.

Date post:	29-Nov-2014
Category:	Technology
Upload:	antony-williams-chemconnector
View:	2,197 times
Download:	1 times