Serving the medicinal chemistry community with Royal Society of Chemistry cheminformatics platforms

Post on 20-Jun-2015

5,116 views 0 download

Tags:

description

The Royal Society of Chemistry (RSC) is a major participant in providing access to chemistry related data via the web. As an internationally renowned society for the chemical sciences, a scientific publisher and the host of the ChemSpider database for the community, RSC continues to make dramatic strides in providing online access to data. ChemSpider provides access to over 30 million chemicals sourced from over 500 data suppliers and linked out to related information on the web. The platform is a crowdsourcing environment whereby members of the community can participate in validating and expanding the content of the database. With a set of application programming interfaces ChemSpider is used by various organizations and projects to serve up data for various purposes. These include structure identification for mass spectrometry instrument vendors, RSC databases such as the Marinlit natural products database and a European grant-based project from the Innovative Medicines Initiative fund. This presentation will provide an overview of various cheminformatics activities and projects that RSC is involved with to serve the medicinal chemistry community. This will include the Open PHACTS semantic web project, the PharmaSea project to identify new pharmaceutical leads from the ocean and the UK National Compound Collection to identify new lead compounds contained within PhD theses.

transcript

Serving the Medicinal Chemistry Community with RSC

Cheminformatics Platforms

Antony WilliamsBrazilian Medicinal Chemistry Conference,

November 11th 2014

www.slideshare.net/AntonyWilliams

Chemistry for the Community

• The Royal Society of Chemistry as a provider of chemistry for the community:• As a charity • As a scientific publisher• As a host of commercial databases• As a partner in grant-based projects• As the host of ChemSpider• New: the RSC Data Repository for Chemistry

Overwhelmed with data…

Organizations releasing data

Funders encourage openness

We model data – then lose it

• What if we could share models and the underlying data via a central repository?

• This is MOSTLY not a technology issue!!!

Pharma Companies Repeat Work

Pre-competitive Informatics:Pharma are all accessing, processing, storing & re-processing external research data

LiteraturePubChem

GenbankPatents

DatabasesDownloads

Data Integration Data AnalysisFirewalled Databases

Repeat at each

companyx

Publications lock up data

When I finish this article…

The data will be locked up..

But what if we could navigate?

What’s the structure?

Are they in our file?

What’s similar?

What’s the target?Pharmacology

data?

Known Pathways?

Working On Now?Connections

to disease?

Expressed in right cell type?

Competitors?

IP?

• ~30 million chemicals and growing• Data sourced from >500 different sources• Crowd sourced curation and annotation• Ongoing deposition of data from our

journals and our collaborators• Structure centric hub for web-searching• It’s a really big dictionary!!!

ChemSpider

ChemSpider

Experimental/Predicted Properties

Literature references

Google Books

Patents references

RSC Databases

Vendors and data sources

With structures and names…

Name Searching

Standards for Integration

Structure Searching

What did we learn???

• Data Quality is an enormous challenge• Crowd sourced annotation can help!

Crowdsourced Enhancement

• The community can clean and enhance the database by providing Feedback and direct curation

• Tens of thousands of edits made

But Software Can Help

• SRS as guidance for standardization rules

http://cvsp.chemspider.com

ChemSpider is a building block

…for grant-based services

• Use ChemSpider data slices and API/Web services to support grant-based projects

• Multiple European consortium-based grants• PharmaSea (FP7 funded)• Open PHACTS (IMI funded)

• Support Open Drug Discovery projects

Over half of all drugs introduced between 1940 and 2006 were of natural origin or inspired by natural compounds

http://www.pharma-sea.eu/

PharmaSea

..as a dereplication platform

http://www.openphacts.org/

• 3-year Innovative Medicines Initiative project

• Integrating chemistry and biology data using semantic web technologies

• Open code, open data, open standards

• Academics, Pharma Companies, Publishers…

The Open PHACTS community ecosystem

Open PHACTS http://ops.rsc.org

Chemistry Searching…

But what about Biology???

http://explorer.openphacts.org

Open PHACTS Explorer

Pharmacology by Compound

Compounds and enzymes

Compounds and enzymes

Pharmacology by Target

Facilitated by ChemSpider RDF

Open Sourcing Data and Code

• Open PHACTS data licensed as Open Data – approx. 2 Million chemicals

• Open Source code to release to community (from Open PHACTS github site)• Chemical Registration Service• Chemical Validation and Standardization Platform

ChemSpider as a “dictionary”

• Systematic name(s)• Trivial Name(s)• SMILES• InChI Strings• InChIKeys• Database IDs• Registry Number

Valium on ChemSpider

With strong dictionaries connections can be made…

Semantic Mark-up of Articles

Linking Names to Structures

MedChemComm

…and providing more links…

Mark-up of MedChemComm

Mark-up of MedChemComm

Mark-up of MedChemComm

Links out from MedChemComm

Links out from MedChemComm

What about old publications?

• We would LOVE to bring data out of our archive• Find chemical names and generate structures• Find chemical images and generate structures• Find reactions – and make a database!• Find data (MP, BP, LogP) and host. Build models!• Find figures and database them• Find spectra (and link to structures)• Validate the data algorithmically

RSC Archive – since 1841

SO MANY reactions!

Text Mining

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Text Mining

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

But names = structures

• Systematic names can be generated FROM chemical structures algorithmically

..and Context Gives Reactions

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

ChemSpider Reactions

Text spectra?

• 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

Turn “Figures” Into Data

FIGURE

EXTRACTED

Models published from data

Text-mining Data to compare

Presently extracting property data from

Google Patents as test

National Compound Collection

• Unlock chemistry data in PhD theses• Discover novel molecules for biosciences• Working together with industry and the

academic community• Build the data into RSC online platforms• Perform virtual screening/modeling and access

physical samples to screen

We should make sure thesis data are available in consumable formats – compounds, reactions, analytical data etc.

What are we building?

• We are building the “RSC Data Repository”• Containers for compounds, reactions, analytical

data, tabular data• Algorithms for data validation and standardization • Flexible indexing and search technologies• A platform for modeling data and hosting existing

models and predictive algorithms• Chemistry RESEARCH DATA MANAGEMENT

Compounds

Reactions

Analytical data

Crystallography data

What’s the structure?

Are they in our file?

What’s similar?

What’s the target?Pharmacology

data?

Known Pathways?

Working On Now?Connections

to disease?

Expressed in right cell type?

Competitors?

IP?

With data in hand maybe it’s time

Conclusions

• We are building platforms that can support multiple communities, including MedChem

• We are working hard to extend our data and improve quality of online data

• Opening APIs to the platforms and data provides a powerful building block

• We are Open Sourcing components of our platforms to the community

Thank youEmail: williamsa@rsc.orgORCID: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams