Learning Molecular Fingerprintsfrom the Graph Up
David Duvenaud, Dougal Maclaurin,
Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli,
Timothy Hirzel, Alán Aspuru-Guzik, Ryan P. Adams
Harvard University
September 30, 2015
Motivation
• Want to do regression onmolecules
• For virtual screening ofdrugs, materials, etc.
• Problem: Molecules can beany size and shape
• Only know how to learnfrom fixed-size examples.
• How to take a molecule inand produce a fixed-sizevector?
Circular Fingeprints
• Standard method lists allsubstructures below acertain size
• Can do this bycombining hashes ofeach atom with andbonded neighbors
• Hash value indexes intoa fixed-sized vector
• Problem: can’t optimizewith gradients
What would Ryan do?
• Maybe we can build amessage-passingnetwork
• same function is appliedto each node (atom) andits neighbors
• Like a convolutional net• At the top, add all node’s
vectors together• If we use a softmax, this
generalizes circularfingerprints
Continuous-izing Circular Fingerprints
Circular fingerprints1: Input: molecule, radius R, fingerprint
length S2: Initialize: fingerprint vector f← 0S3: for each atom a in molecule do4: ra ← g(a) . lookup atom features5: for L = 1 to R do . for each layer6: for each atom a in molecule do7: r1 . . . rN = neighbors(a)8: v← [ra, r1, . . . , rN ] . concatenate9: ra ← hash(v) . hash function10: i ← mod(ra,S) . convert to index11: fi ← 1 . Write 1 at index12: Return: binary vector f
Neural graph fingerprints1: Input: molecule, radius R, weights
H11 . . .H
5R , output weights W1 . . .WR
2: Initialize: fingerprint vector f← 0S3: for each atom a in molecule do4: ra ← g(a) . lookup atom features5: for L = 1 to R do . for each layer6: for each atom a in molecule do7: r1 . . . rN = neighbors(a)8: v← ra +
∑Ni=1 ri . sum
9: ra ← σ(vHNL ) . smooth function
10: i← softmax(raWL) . sparsify11: f← f + i . add to fingerprint12: Return: real-valued vector f
Every non-differentiable operation is replaced with adifferentiable analog.
Generalizing Circular Fingerprints
• If we generalize existingfingerprints, we can’t notwin (unless we overfit)
• large random weightsmakes neural nets act likehash functions
• Looked at similaritiesbetween pairwisedistances. 0.5 0.6 0.7 0.8 0.9 1.0
Circular fingerprint distances
0.5
0.6
0.7
0.8
0.9
1.0
Neu
ral
fin
gerp
rin
t d
ista
nce
s
Neural vs Circular distances, r=0:823
Generalizing Circular Fingerprints
• If we generalize existingfingerprints, we can’t notwin (unless we overfit)
• large random weightsmakes neural nets act likehash functions
• Looked at performance ofrandom weights. 0 1 2 3 4 5 6
Fingerprint radius
0.8
1.0
1.2
1.4
1.6
1.8
2.0
RM
SE
(lo
g M
ol/
L)
Circular fingerprints
Random conv with large parameters
Random conv with small parameters
Performance
Dataset Solubility Drug efficacy Photovoltaic efficiencyUnits log Mol/L EC50 in nM percent
Predict mean 4.29 ± 0.40 1.47 ± 0.07 6.40 ± 0.09Circular FPs + linear layer 1.84 ± 0.08 1.13 ± 0.03 2.62 ± 0.07Circular FPs + neural net 1.40 ± 0.15 1.24 ± 0.03 2.04 ± 0.07Neural FPs + linear layer 0.74 ± 0.09 1.16 ± 0.03 2.71 ± 0.13Neural FPs + neural net 0.53 ± 0.07 1.17 ± 0.03 1.44 ± 0.11
• Could also try varying depth of neural net on top(used one hidden layer here)
Interpretability
• Circular fingerprintsactivate for a singlesubstructure
• No generalization• No notion of similarity• Let’s put a linear layer on
top of neural fingerprintsand examine whichfragments activate mostpredictive features.
Interpretability: Solubility
Fragments activating feature most predictive of solubility:
OOH
O
NH
O
OH
OH
most predictive of insolubility:
Interpretability: Toxicity
Fragments most activated by toxicity feature on SR-MMPdataset:
Fragments most activated by toxicity feature on NR-AHRdataset:
Future Work
• Limitation: Slow because ofso many weight transforms
• Could use low-rank weightmatrices
• Limitation: All features arelocal
• Could learn to “parse”molecules
• But how to take gradients?
Delaney, John S. ESOL: Estimating aqueous solubilitydirectly from molecular structure. Journal of ChemicalInformation and Computer Sciences, 44(3):1000–1005,2004.
Gamo, Francisco-Javier, Sanz, Laura M, Vidal, Jaume,de Cozar, Cristina, Alvarez, Emilio, Lavandera,Jose-Luis, Vanderwall, Dana E, Green, Darren VS,Kumar, Vinod, Hasan, Samiul, et al. Thousands ofchemical starting points for antimalarial leadidentification. Nature, 465(7296):305–310, 2010.
Hachmann, Johannes, Olivares-Amaya, Roberto,Atahan-Evrenk, Sule, Amador-Bedolla, Carlos,Sánchez-Carrera, Roel S, Gold-Parker, Aryeh, Vogt,Leslie, Brockway, Anna M, and Aspuru-Guzik, Alán.The Harvard clean energy project: large-scalecomputational screening and design of organicphotovoltaics on the world community grid. The Journalof Physical Chemistry Letters, 2(17):2241–2251, 2011.
11 / 11
Tox21 Challenge. National center for advancingtranslational sciences.http://tripod.nih.gov/tox21/challenge,2014. [Online; accessed 2-June-2015].
11 / 11