MadFast Similarity Search
Gábor Imre
Agenda
• Short intorduction to MadFast
• Demo:
– Getting started
– Using Web UI / REST API
– Using the command line
– Searching large datasets
• Roadmap
• QA
Introductiona short one
How fast?
How fast?
• Very.
How fast?
• Very.
• Some numbers (ec2 c3.8xlarge/r3.8xlarge machine):
– Search for few 10s of most similars for a single query: 200 M targets / s
– Prepare a single fingerprint: 1 M targets / min
– Read prepared data into memory: 1 M targets / s
– Using ~250 MB memory per M targets (1024 bit fp)
How fast?
• Very.
• Some numbers (ec2 c3.8xlarge/r3.8xlarge machine):
– Search for few 10s of most similars for a single query: 200 M targets / s
– Prepare a single fingerprint: 1 M targets / min
– Read prepared data into memory: 1 M targets / s
– Using ~250 MB memory per M targets (1024 bit fp)
• What does it mean?
– Real time similarity search of tens of millions of structures even on a desktop.
– Or handle even 1B strtuctures on an r3.8xlarge instance
– Or provide near real time search of 1B structures (~5s / query)
– Or do an exhaustive 1M x 1M similarity search in 30 mins
What is MadFast Similarity Search?
• A young product, released in last December
• Engine for fast similarity searching with efficient in-memory storage
Binary and float vector descriptors with various metrics
• Fast descriptor (fingerprint) calculation is also provided
CFP, ECFP, MACCS-166 included, can use externally calculated fingerprints
• Collection of stand-alone tools implemented in Java
• Providing CLI, Web UI and REST API interfaces
Speed of typical tasks
Getting startedis simple
On the product page
At https://www.chemaxon.com/products/madfast
Follow “Download MadFast” link
To the download page
Where a .tar distribution is available.
You will need Linux or Windows + Cygwin to use.
You will need
Oracle Java 1.8 installed
Cygwin installed (on Windows)
ChemAxon License file
- free evaluation: [email protected]
- copy to ~/.chemaxon/license.cxl orc:\Users\<username>\chemaxon\license.cxl
And the .tar file unpacked - we go with
windows+cygwin from now
The Web UI and the REST API
No further installation required.
• Run a command line to see if Java runs correctly: bin/searchStorage.sh -h
• Launch a self contained example: examples/rest-api-example.sh
and connect to the launched embedded server at http://localhost:8085/
• Will explore the Web UI using a more meaningful dataset. Launchexamples/rest-api-small.sh
then connect http://localhost:8085/
Under the hood
Focused chemical space exploration
Or query through the REST API
curl \
-X POST \
-d "count=4" \
--data-urlencode "query=C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O" \
-g \
"http://localhost:8085/rest/descriptors/nci-250k-cfp7/find-most-similars" | python -m json.tool
Or query through the REST API
curl \
-X POST \
-d "count=4" \
--data-urlencode "query=C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O" \
-g \
"http://localhost:8085/rest/descriptors/nci-250k-cfp7/find-most-similars" | python -m json.tool
{
"query": "C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O",
"querysmi": "OC[C@H](O)[C@H]1OC(=O)C(O)=C1O",
"searchtime": 16,
"targetcount": 249081,
"targets": [
{
"base64img": null,
"dissimilarity": 0.0,
"targetid": "NCI8117",
"targetimageurl": "rest/molecules/nci-250k/7975/png-or-placeholder?w=100&h=100",
"targetindex": 7975,
"targetmolurl": "rest/molecules/nci-250k/7975"
},
…
]
}
The command line
Multi query search
gzip -dc data/molecules/drugbank/drugbank-common_name.smi.gz | \
bin/searchStorage.sh \
-tmf - \
-qmf data/molecules/vitamins/vitamins.smi \
-context createSimpleCfp7Context
Multi query search
gzip -dc data/molecules/drugbank/drugbank-common_name.smi.gz | \
bin/searchStorage.sh \
-tmf - \
-qmf data/molecules/vitamins/vitamins.smi \
-context createSimpleCfp7Context
Query Target Dissimilarity
0 54 0.0
1 409 0.14814814814814814
2 6031 0.02631578947368421
3 44 0.0
4 32 0.0
5 513 0.0
...
More hits, IDs, formatting, out file
gzip -dc data/molecules/drugbank/drugbank-common_name.smi.gz | \
bin/searchStorage.sh \
-tmf - \
-qmf data/molecules/vitamins/vitamins.smi \
-context createSimpleCfp7Context \
-mode MOSTSIMILARS -count 3 \
-qidname -tidname -out-numeric-format "%.3f" -out res.txt
cat res.txt
More hits, IDs, formatting, out file
gzip -dc data/molecules/drugbank/drugbank-common_name.smi.gz | \
bin/searchStorage.sh \
-tmf - \
-qmf data/molecules/vitamins/vitamins.smi \
-context createSimpleCfp7Context \
-mode MOSTSIMILARS -count 3 \
-qidname -tidname -out-numeric-format "%.3f" -out res.txt
cat res.txt
Query Target Dissimilarity
Vitamin A - Retinol Vitamin A 0.000
Vitamin A - Retinol Alitretinoin 0.148
Vitamin A - Retinol Tretinoin 0.148
Vitamin A - Retinal Alitretinoin 0.148
Vitamin A - Retinal Tretinoin 0.148
Vitamin A - Retinal Isotretinoin 0.148
Vitamin A - beta-Carotene 1,3,3-trimethyl-2-[(1E,3E)-3-methylpenta-1,3-dien-1-yl]cyclohexene 0.026
Vitamin A - beta-Carotene Vitamin A 0.174
Vitamin A - beta-Carotene (6e)-6-[(2e,4e,6e)-3,7-Dimethylnona-2,4,6,8-Tetraenylidene]-1,5,5-Trimethylcyclohexene 0.200
Vitamin B1 - Thiamine Thiamine 0.000
Vitamin B1 - Thiamine Thiamin Phosphate 0.158
Heatmap
bin/searchStorage.sh \
-context createSimpleCfp7Context \
-qmf data/molecules/vitamins/vitamins.smi \
-qidname \
-tmf data/molecules/vitamins/vitamins.smi \
-tidname \
-mode FULLMATRIX \
-out vitamins-fullmatrix.txt \
-heatmap-image vitamins-fullmatrix.png \
-heatmap-image-cellsize 15 \
-heatmap-image-query-ids-length 250 \
-heatmap-image-target-ids-length 250
Heatmap
bin/searchStorage.sh \
-context createSimpleCfp7Context \
-qmf data/molecules/vitamins/vitamins.smi \
-qidname \
-tmf data/molecules/vitamins/vitamins.smi \
-tidname \
-mode FULLMATRIX \
-out vitamins-fullmatrix.txt \
-heatmap-image vitamins-fullmatrix.png \
-heatmap-image-cellsize 15 \
-heatmap-image-query-ids-length 250 \
-heatmap-image-target-ids-length 250
Further inputs for search
Search ~1B structures
A server with GDB-13 launched
• An r3.8xlarge instance on Amazon EC2 is running during this webinar32 vCPUs, 244 GiB RAM, currently for $2.66 / hour
• Near real time search:http://ec2-54-74-38-126.eu-west-
1.compute.amazonaws.com:8081/simsearch.html?ref=rest/descriptors/gdb-13-cfp7&dist=hide
• Plus SureChEMBL:http://ec2-54-74-38-126.eu-west-
1.compute.amazonaws.com:8081/simsearch.html?ref=rest/descriptors/surechembl-cfp7/
REST API is also available
curl \
-X POST \
-d "count=100" \
--data-urlencode "query=C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O" \
-g \
http://ec2-54-74-38-126.eu-west-1.compute.amazonaws.com:8081/rest/descriptors/gdb-13-cfp7/find-most-similars | python -m json.tool
REST API is also available
curl \
-X POST \
-d "count=100" \
--data-urlencode "query=C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O" \
-g \
http://ec2-54-74-38-126.eu-west-1.compute.amazonaws.com:8081/rest/descriptors/gdb-13-cfp7/find-most-similars | python -m json.tool
{
"query": "C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O",
"querysmi": "OC[C@H](O)[C@H]1OC(=O)C(O)=C1O",
"searchtime": 3917,
"targetcount": 977468301,
"targets": [
{
"base64img": null,
"dissimilarity": 0.20270270270270271,
"targetid": "MOLECULE-043953590",
"targetimageurl": "rest/molecules/gdb-13/43953590/png-or-placeholder?w=100&h=100",
"targetindex": 43953590,
"targetmolurl": "rest/molecules/gdb-13/43953590"
},
…
]
}
Check out our study
• Poster presented at Fragments 2017, available at
https://www.chemaxon.com/library/
similarity-implicated-exploration-of-the-fragment-galaxy/
• Using MadFasy to search drug analogues among 977M
targets from GDB-13
• Which are assessed for parent coverage by searching
the 16M structures in SureChEMBL
• After 4h setup time 20s / query
(assessment of the 100 best analogues)
• Overlap visualization concepts
Whats nextRoadmap, development directions
Roadmap
• Overlap analysis visualization
• Clustering
- Real time clustering
- Similarity based hierachic clustering
- On all interfaces
• Query remote DB using JDBC
• Single desktop UI release.
• Public Java API components for developers
Feedback welcome
• What are your use cases? What are your pain point?Distribution, deploymets, platforms, interfaces
• The proposed roadmapFeedback on priorities; functionalities; whats missing
• Syncing remote DB over JDBCRequirements, data sizes, update frequency, update patterns
• Interactive clusteringTypical set sizes, method preferences, workflows
• MadFast Substructure SearchQuery semantics; use cases; requirements
Further resources
From https://chemaxon.com follow
Products ⇒ Discovery Toolkit ⇒ MadFast ...
Contains
- Introduction, overview
- Links to download, documentation
- Link to online demo
https://www.chemaxon.com/products/madfast
Product page
Documentation
Detailed documentation, including
- Step-by-step getting started guide.
- Walkthrough of typical use cases
- Advanced topics
- JAVA/REST API docs
Also available in the downloaded distribution.
https://disco.chemaxon.com/products/madfast/latest/
Online demo
One of the examples from the distribution available online athttps://disco.chemaxon.com/madfast-demo
Questions
Please help us during QA
By answering three questionnaires regarding
• The distribution
• The interfaces
• The roadmap
The roadmap
We would be interested in
• Similarity based overlap visualization
• Clustering provided with CLI/REST API
• Real time clustering
• Simialrity based diverse selection
• Synching to existing databases over JDBC
Please tell us about the typical set sizes you prefer in the chat.
The interfaces
• Would like to use JAVA Client library to connect to MadFast REST API
• Would like to use JAVA API to embed MadFast
• Web UI would be needed to access all functionalities
• Full featured desktop GUI would be needed to access all functionalities
• Authentication/authorization would be needed on Web UI / REST API
The distribution
• Would need Windows .zip distribution / .bat starter scripts
• Would need Linux installer
• Would need Windows installer
• Would need MacOS installer
THANK YOU