MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in...

MadFast Similarity Search

Gábor Imre

Agenda

• Short intorduction to MadFast

• Demo:

– Getting started

– Using Web UI / REST API

– Using the command line

– Searching large datasets

• Roadmap

• QA

Introductiona short one

How fast?

How fast?

• Very.

How fast?

• Very.

• Some numbers (ec2 c3.8xlarge/r3.8xlarge machine):

– Search for few 10s of most similars for a single query: 200 M targets / s

– Prepare a single fingerprint: 1 M targets / min

– Read prepared data into memory: 1 M targets / s

– Using ~250 MB memory per M targets (1024 bit fp)

How fast?

• Very.

• Some numbers (ec2 c3.8xlarge/r3.8xlarge machine):

– Search for few 10s of most similars for a single query: 200 M targets / s

– Prepare a single fingerprint: 1 M targets / min

– Read prepared data into memory: 1 M targets / s

– Using ~250 MB memory per M targets (1024 bit fp)

• What does it mean?

– Real time similarity search of tens of millions of structures even on a desktop.

– Or handle even 1B strtuctures on an r3.8xlarge instance

– Or provide near real time search of 1B structures (~5s / query)

– Or do an exhaustive 1M x 1M similarity search in 30 mins

What is MadFast Similarity Search?

• A young product, released in last December

• Engine for fast similarity searching with efficient in-memory storage

Binary and float vector descriptors with various metrics

• Fast descriptor (fingerprint) calculation is also provided

CFP, ECFP, MACCS-166 included, can use externally calculated fingerprints

• Collection of stand-alone tools implemented in Java

• Providing CLI, Web UI and REST API interfaces

Speed of typical tasks

Getting startedis simple

On the product page

At https://www.chemaxon.com/products/madfast

Follow “Download MadFast” link

https://www.chemaxon.com/products/madfast

To the download page

Where a .tar distribution is available.

You will need Linux or Windows + Cygwin to use.

You will need

Oracle Java 1.8 installed

Cygwin installed (on Windows)

ChemAxon License file

- free evaluation: [email protected]

- copy to ~/.chemaxon/license.cxl orc:\Users\<username>\chemaxon\license.cxl

And the .tar file unpacked - we go with

windows+cygwin from now

mailto:[email protected]

The Web UI and the REST API

No further installation required.

• Run a command line to see if Java runs correctly: bin/searchStorage.sh -h

• Launch a self contained example: examples/rest-api-example.sh

and connect to the launched embedded server at http://localhost:8085/

• Will explore the Web UI using a more meaningful dataset. Launchexamples/rest-api-small.sh

then connect http://localhost:8085/

http://localhost:8085/

http://localhost:8085/

Under the hood

Focused chemical space exploration

Or query through the REST API

curl \

-X POST \

-d "count=4" \

--data-urlencode "query=C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O" \

-g \

"http://localhost:8085/rest/descriptors/nci-250k-cfp7/find-most-similars" | python -m json.tool

Or query through the REST API

curl \

-X POST \

-d "count=4" \


-g \

"http://localhost:8085/rest/descriptors/nci-250k-cfp7/find-most-similars" | python -m json.tool

{

"query": "C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O",

"querysmi": "OC[C@H](O)[C@H]1OC(=O)C(O)=C1O",

"searchtime": 16,

"targetcount": 249081,

"targets": [

{

"base64img": null,

"dissimilarity": 0.0,

"targetid": "NCI8117",

"targetimageurl": "rest/molecules/nci-250k/7975/png-or-placeholder?w=100&h=100",

"targetindex": 7975,

"targetmolurl": "rest/molecules/nci-250k/7975"

},

…

]

}

The command line

Multi query search

gzip -dc data/molecules/drugbank/drugbank-common_name.smi.gz | \

bin/searchStorage.sh \

-tmf - \

-qmf data/molecules/vitamins/vitamins.smi \

-context createSimpleCfp7Context

Multi query search



-tmf - \


-context createSimpleCfp7Context

Query Target Dissimilarity

0 54 0.0

1 409 0.14814814814814814

2 6031 0.02631578947368421

3 44 0.0

4 32 0.0

5 513 0.0

...

More hits, IDs, formatting, out file



-tmf - \


-context createSimpleCfp7Context \

-mode MOSTSIMILARS -count 3 \

-qidname -tidname -out-numeric-format "%.3f" -out res.txt

cat res.txt

More hits, IDs, formatting, out file



-tmf - \



-mode MOSTSIMILARS -count 3 \

-qidname -tidname -out-numeric-format "%.3f" -out res.txt

cat res.txt

Query Target Dissimilarity

Vitamin A - Retinol Vitamin A 0.000

Vitamin A - Retinol Alitretinoin 0.148

Vitamin A - Retinol Tretinoin 0.148

Vitamin A - Retinal Alitretinoin 0.148

Vitamin A - Retinal Tretinoin 0.148

Vitamin A - Retinal Isotretinoin 0.148

Vitamin A - beta-Carotene 1,3,3-trimethyl-2-[(1E,3E)-3-methylpenta-1,3-dien-1-yl]cyclohexene 0.026

Vitamin A - beta-Carotene Vitamin A 0.174

Vitamin A - beta-Carotene (6e)-6-[(2e,4e,6e)-3,7-Dimethylnona-2,4,6,8-Tetraenylidene]-1,5,5-Trimethylcyclohexene 0.200

Vitamin B1 - Thiamine Thiamine 0.000

Vitamin B1 - Thiamine Thiamin Phosphate 0.158

Heatmap




-qidname \

-tmf data/molecules/vitamins/vitamins.smi \

-tidname \

-mode FULLMATRIX \

-out vitamins-fullmatrix.txt \

-heatmap-image vitamins-fullmatrix.png \

-heatmap-image-cellsize 15 \

-heatmap-image-query-ids-length 250 \

-heatmap-image-target-ids-length 250

Heatmap




-qidname \

-tmf data/molecules/vitamins/vitamins.smi \

-tidname \

-mode FULLMATRIX \

-out vitamins-fullmatrix.txt \

-heatmap-image vitamins-fullmatrix.png \

-heatmap-image-cellsize 15 \

-heatmap-image-query-ids-length 250 \

-heatmap-image-target-ids-length 250

Further inputs for search

Search ~1B structures

A server with GDB-13 launched

• An r3.8xlarge instance on Amazon EC2 is running during this webinar32 vCPUs, 244 GiB RAM, currently for $2.66 / hour

• Near real time search:http://ec2-54-74-38-126.eu-west-

1.compute.amazonaws.com:8081/simsearch.html?ref=rest/descriptors/gdb-13-cfp7&dist=hide

• Plus SureChEMBL:http://ec2-54-74-38-126.eu-west-

1.compute.amazonaws.com:8081/simsearch.html?ref=rest/descriptors/surechembl-cfp7/

http://ec2-54-74-38-126.eu-west-1.compute.amazonaws.com:8081/simsearch.html?ref=rest/descriptors/gdb-13-cfp7&dist=hide

http://ec2-54-74-38-126.eu-west-1.compute.amazonaws.com:8081/simsearch.html?ref=rest/descriptors/surechembl-cfp7/

REST API is also available

curl \

-X POST \

-d "count=100" \


-g \

http://ec2-54-74-38-126.eu-west-1.compute.amazonaws.com:8081/rest/descriptors/gdb-13-cfp7/find-most-similars | python -m json.tool

REST API is also available

curl \

-X POST \

-d "count=100" \


-g \

http://ec2-54-74-38-126.eu-west-1.compute.amazonaws.com:8081/rest/descriptors/gdb-13-cfp7/find-most-similars | python -m json.tool

{

"query": "C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O",

"querysmi": "OC[C@H](O)[C@H]1OC(=O)C(O)=C1O",

"searchtime": 3917,

"targetcount": 977468301,

"targets": [

{

"base64img": null,

"dissimilarity": 0.20270270270270271,

"targetid": "MOLECULE-043953590",

"targetimageurl": "rest/molecules/gdb-13/43953590/png-or-placeholder?w=100&h=100",

"targetindex": 43953590,

"targetmolurl": "rest/molecules/gdb-13/43953590"

},

…

]

}

Check out our study

• Poster presented at Fragments 2017, available at

https://www.chemaxon.com/library/

similarity-implicated-exploration-of-the-fragment-galaxy/

• Using MadFasy to search drug analogues among 977M

targets from GDB-13

• Which are assessed for parent coverage by searching

the 16M structures in SureChEMBL

• After 4h setup time 20s / query

(assessment of the 100 best analogues)

• Overlap visualization concepts

https://www.chemaxon.com/library/similarity-implicated-exploration-of-the-fragment-galaxy/

https://www.chemaxon.com/library/similarity-implicated-exploration-of-the-fragment-galaxy/

Whats nextRoadmap, development directions

Roadmap

• Overlap analysis visualization

• Clustering

- Real time clustering

- Similarity based hierachic clustering

- On all interfaces

• Query remote DB using JDBC

• Single desktop UI release.

• Public Java API components for developers

Feedback welcome

• What are your use cases? What are your pain point?Distribution, deploymets, platforms, interfaces

• The proposed roadmapFeedback on priorities; functionalities; whats missing

• Syncing remote DB over JDBCRequirements, data sizes, update frequency, update patterns

• Interactive clusteringTypical set sizes, method preferences, workflows

• MadFast Substructure SearchQuery semantics; use cases; requirements

Further resources

From https://chemaxon.com follow

Products ⇒ Discovery Toolkit ⇒ MadFast ...

Contains

- Introduction, overview

- Links to download, documentation

- Link to online demo


Product page

https://chemaxon.com


Documentation

Detailed documentation, including

- Step-by-step getting started guide.

- Walkthrough of typical use cases

- Advanced topics

- JAVA/REST API docs

Also available in the downloaded distribution.

https://disco.chemaxon.com/products/madfast/latest/

https://disco.chemaxon.com/products/madfast/latest/

Online demo

One of the examples from the distribution available online athttps://disco.chemaxon.com/madfast-demo

https://disco.chemaxon.com/madfast-demo

Questions

Please help us during QA

By answering three questionnaires regarding

• The distribution

• The interfaces

• The roadmap

The roadmap

We would be interested in

• Similarity based overlap visualization

• Clustering provided with CLI/REST API

• Real time clustering

• Simialrity based diverse selection

• Synching to existing databases over JDBC

Please tell us about the typical set sizes you prefer in the chat.

The interfaces

• Would like to use JAVA Client library to connect to MadFast REST API

• Would like to use JAVA API to embed MadFast

• Web UI would be needed to access all functionalities

• Full featured desktop GUI would be needed to access all functionalities

• Authentication/authorization would be needed on Web UI / REST API

The distribution

• Would need Windows .zip distribution / .bat starter scripts

• Would need Linux installer

• Would need Windows installer

• Would need MacOS installer

THANK YOU

Date post:	20-Apr-2020
Category:	Documents
Upload:	others
View:	8 times
Download:	0 times

MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in...

Documents