+ All Categories
Home > Software > USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database...

USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database...

Date post: 11-Jul-2015
Category:
Upload: chemaxon
View: 127 times
Download: 0 times
Share this document with a friend
Popular Tags:
17
Kevin P. Clark, Ph.D. Chemaxon UGM, Cambridge, MA September 2014 Searching Project Team Documents with D2DB
Transcript
Page 1: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Kevin P. Clark, Ph.D.

Chemaxon UGM, Cambridge, MA

September 2014

Searching Project Team

Documents with D2DB

Page 2: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Outline

• Use case: search and mine project team documents

• Text searching using Apache SOLR™

• Structure searching using D2DB

• Conclusion

Pa

ge

2

Page 3: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Use Case: Searching and mining project team documents

• Project teams generate numerous documents

• Project team reviews

• Target Candidate Profile

• Regular medicinal chemistry design session

• Presentations and reports

• HTS and Fragment screening

• In-vivo reports (e.g. safety and pk/pd)

• Computational chemistry presentations and docking ideas

• Structural biology presentations

• Diagnostics and biomarker reports

• Publications

• Most documents are generated by the distributed project team

3

Page 4: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Google Drive to manage project team documents

• Rationale

• Ease of use

• No workflow functionality required

• Access to our partners and CROs

• Versioning and simultaneous editing of Google native documents

• Much of the administration in the hands of the project teams

• Organization and structure

• Access permissions

• Shortcomings

• No wildcard or substring search

• Users have difficult time finding documents

• Today over 84K project team documents

4

Page 5: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

SOLR provides text searching of project team documents

• Open source enterprise search platform from Apache LuceneTM

• Full-text search

• Faceted search

• Hit highlighting

• Wildcard searches

• Regular expression searches

• Proximity searches

• Fuzzy searches

• Documents from Google drive are copied to file system inside our fire wall every 30 minutes

• Security details

• SOLR servers restrict access by internet protocol (IP address)

• LucidWorks implemented the LDAP integration for authorization

5

File

System

Page 6: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Text search input page 6

Page 7: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

SOLR text search result page 7

Page 8: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Introducing structure search with D2DB

• SOLR search application allows users to:

• Find documents by full text search

• Facet the results for narrowing the hit list

• Wildcard and regular expression search

• Proximity and fuzzy searches

• With text based search

• Partial corporate identifiers (G*1234)

• Partial corporate identifiers for a project (“Project A” AND G*1234)

• HTS hit follow up for “Project A”

• How did other teams handle time dependent inhibition (TDI)?

• Structure searching examples

• Finding documents by structure without corporate identifier

• Find all documents with a particular substructure

8

Page 9: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Extracting chemical information from documents with D2DB

• ChemAxon’s Naming Technology

• IUPAC names

• Common names

• Drug trade names

• SMILES

• InChi

• CAS registry numbers

• Embedded structures

• ChemDraw

• SymyxDraw

• MarvinSketch

• Optical structure recognition

• OSRA

• CLiDE

• Imago

9

Page 10: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Configuring and running D2DB

• Edit the configuration properties file

• Specify the database parameters

• db.type = oracle

• db.host = orcl.gene.com

• db.port = 1521

• db.name = orcl.gene.com

• db.username = scott

• db.password = tiger

• Specify other options such as

• d2s.options = -osra

• d2db.threads = 16

• d2db.structure_table.format = mol

• Run d2db from command line

• ./d2db d2db.conf create

• ./d2db d2db.conf index <path>

10

Page 11: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Integration of D2DB using ChemAxon’s technology

• Marvin for JS

• Google dropping support for NPAPI in Chrome

• Replacing Chemdraw plug-in with Marvin for JS

• Chemdraw 14 supports copy/paste molfile

• JChem cartridge for structure searching

• Successfully migrated from Accord to JChem cartridge

• Performance improvement

• D2DB

• Naming technology

• Embedded chemistry

• Optical structure recognition

11

Page 12: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Embedded ChemDraw structure for Oseltamivir (Tamiflu) 12

Page 13: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Exact structure search for Oseltamivir (Tamiflu) 13

Page 14: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Results page for the structure searching 14

Page 15: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Extending D2DB to recognize corporate identifiers

• Many of our documents contain references to Gnumbers, our corporate identifiers

• Working with Daniel Bonniot, D2DB now adds structures for corporate identifiers

• Configuration file changes

15

Page 16: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Conclusion

• Indexing results

• 79K out of 84K documents have been indexed

• 5K documents failed due a few common exceptions

• Extracted over 150K structures with 30K from Gnumbers

• D2DB conclusion

• Easy to configure and run

• Chemaxon (Daniel Bonniot) very responsive to requests and questions

• Enabled structure searching of project team documents

• Future directions

• Collaborate with ChemAxon to resolve exceptions

• Evaluate CLiDE

• Combine text and structure searching across project team documents

16

Page 17: USUGM 2014 - Kevin Clark (Genentech): Searching Project Team Documents with Document to Database (D2DB)

Recommended