Supporting Information
A Universal Deep Learning Framework based
on Graph Neural Network for Virtual Co-
Crystal Screening
Yuanyuan Jiang a, Jiali Guo a, Yijing Liu b, Yanzhi Guo a,
Menglong Lia, Xuemei Pu a,*
a College of Chemistry, Sichuan University, Chengdu, 610064
b College of Computer Science, Sichuan University, Chengdu, 610064
* Corresponding Author
Xuemei Pu ([email protected])
1. Construction of several machine learning models as the controls
DNN constructed by our work contains 6 full-connected layers, as shown by
Figure S1. Excepting for the final output layer, batch normalization1 and ReLU2 are
applied in each layer.
Figure S1. Architecture of DNN.
NCI1 is a spatial-based graph convolution network from Felipe et al3, where key
components are three Graph-CNN layers and two Graph Embedding Pooling (GEP)
layers, as depicted by left of Figure S2. The two kinds of layers perform the message
passing phase. The readout phase is a flattening operation. Details of Graph-CNN see
Methods. Here we mainly introduce the GEP layer as shown in right of Figure S1.
Figure S2. Architecture of NCI1.
Like pooling layers in conventional CNNs, the GEP layer is used to reduce
dimensions of the input, which eliminates redundant information and also improves
performance of computation. GEP transfers a graph with node number ๐ to a given
number ๐โฒ. For this purpose, an embedding matrix ๐ฟ๐๐๐ โ โ๐ร๐โฒ is produced by a
filter tensor ๐ฏ๐๐๐ โ โ๐ร๐ร๐ถร๐โฒ. The calculation of ๐ฟ๐๐๐ is similar to the multiple
filters Graph-CNN (vide Methods), where the learnable filter ๐ฏ๐๐๐ is multiplied by
the node features ๐ฟ๐๐. It is defined by equations (S1-S2):
๐ฟ๐๐๐
(๐โฒ)= โ ๐ฏ๐๐๐
(๐,๐โฒ)๐ฟ๐๐
(๐)+ ๐
๐ถ
๐=1
(S1)
๐ฟ๐๐๐ = softmax(GConv๐๐๐(๐ฟ๐๐, ๐โฒ) + ๐) (S2)
where ๐ฏ๐๐๐(๐,๐โฒ)
โ โ๐ร๐ is a part of ๐ฏ๐๐๐. ๐ฟ๐๐๐(๐โฒ)
is a column of ๐ฟ๐๐๐ โ โ๐ร๐โฒ. The
pooled graph data will be calculated by the next operations (vide equations (S3-S4)).
๐ฟ๐๐ข๐ก = ๐ฟ๐๐๐๐ ๐ฟ๐๐ (S3)
๐จ๐๐ข๐ก = ๐ฟ๐๐๐๐ ๐จ๐๐๐ฟ๐๐๐ (S4)
where ๐จ๐๐ โ โ๐ร๐ is adjacency matrix, ๐จ๐๐ข๐ก โ โ๐โฒร๐โฒ is pooled adjacency matrix.
๐ฟ๐๐ข๐ก โ โ๐โฒร๐ถ is pooled node feature matrix. Finally, a pooled graph that has ๐จ๐๐ข๐ก
and ๐ฟ๐๐ข๐ก is produced by GEP.
enn-s2s, proposed by Gilmer et al4, has two phases (a message passing phase and
a readout phase, as shown in Figure S3. In Gilmerโs work, enn-s2s is a regression
model. Here, in order to extend its application to the classification prediction of
cocrystal formation, we modified the architecture of enn-s2s by changing the dimension
of output layer. The message passing phase includes two functions, i.e., message
passing function and update function. The message passing function is used to
propagate node features, as reflected by equation (S5).
๐๐๐ก = ๐พ๐๐
๐กโ1 + โ ๐๐๐กโ1 โ ๐๐๐(๐๐,๐)
๐โ๐ฉ(๐)
(S5)
Where ๐๐๐ก is the feature of node i in t-th time step, W is trainable weights, ๐ฉ(๐) is
the adjacent nodes of node i, ๐๐,๐ is the feature of edge between node i and j. MLP is
multi-layer perceptron.
Figure S3. Architecture of enn-s2s.
The update function used to update node features is Gated Recurrent Unit (GRU)5, as
described by equation (S6)
๐๐๐ก = ๐บ๐ ๐(๐๐
๐กโ1, ๐๐๐ก) (S6)
Where ๐๐๐ก is the hidden state of node i in t-th time step.
For the readout phase, a feature vector for the whole graph is computed by enn-s2s
that is based on iterative content-based attention from Vinyals et al6 (vide equations S7-
S10)
๐๐ก = LSTM(๐๐กโ1โ ) (S7)
๐ผ๐,๐ก =exp(๐๐ โ ๐๐ก)
โ exp (๐๐ โ ๐๐ก)๐โ๐ฎ (S8)
๐๐ก = โ ๐ผ๐,๐ก
๐
๐=1
๐๐ (S9)
๐๐กโ = ๐๐ก โฅ ๐๐ก (S10)
where i indexes through each node feature vector ๐๐ , ๐๐ก is a query vector which
allows us to read ๐๐ก from the memories at t-th time step, ๐ผ๐,๐ก is attention coefficient
of node i at t-th time step, and LSTM is Long Short-Term Memory7 which calculates a
recurrent state. ๐ฎ is the graph to which node i and j belong. ๐ is the number of nodes
in graph ๐ฎ. โฅ is concatenation. t is the step index, which is the number of times that
the state is computed. The maximum of t is 3 in this work. After the three steps, ๐๐กโ is
the feature vector for the whole graph to be fed to classifier that is two dense layers.
CCGNet-simple are proposed in this work in order to observe the impact of
different feature integration operation, where the message passing phase is three Graph-
CNN layers (vide Methods) and readout function is multi-head global attention (vide
Methods) with 10 heads. After the global attention, the global state U is fused into the
graph embedding.
Figure S4. Architecture of CCGNet-simple.
2. More examples for the attention visualization
Figure S5. Attention Visualization of BAFGEX.
Figure S6. Attention Visualization of VIHKUU.
Figure S9. Attention Visualization of MAQZEK.
Table S3. Solvents involved in collecting cocrystal positive samples from Cambridge
Structural Database.
Toluene 4-Chlorotoluene diglyme
DMSO-d6 1,3,5-trichlorobenzene iodobenzene
trichloromethane-d gamma-Butyrolactone 1,1,2-trichloroethane
ethoxyethane DL-sec-Butyl acetate formic acid
methylamine iodomethane dimethyl sulfoxide
p-Xylene methanamide 3-methyl-1-butanol
1-butanol Tetrahydrofuran bromobenzene
cyclohexanone chlorobenzene dimethoxymethane
1H-pyrrole Ethyl formate 2-butanone
2-butanol isobutanol N-Ethylmorpholine
1,1,2,2-tetrachloroethane N, N, N', N'-Tetramethylethylenediamine propan-2-ol
1,4-dioxane Ethanol 2-methyl-2-propanol
2-methylpyridine 3-methylpyridine 2-butoxyethanol
diethylenetriamine 2-methoxyethanol dibromomethane
1-methyl-2-pyrrolidone N, N-dimethylacetamide 2,2'-Dichlorodiethyl ether
Methyl acetate cyclopentane benzyl alcohol
benzene hexadecane water-d2
nitromethane hexamethyldisiloxane Hexane
1-Chloro-2-Methylpropane acetic anhydride propanenitrile
acetamide acetic acid Ethylene glycol
Diethylene glycol Isopropyl acetate Isopropyl ether
tetrachloromethane acetone acetophenone
nitrobenzene propionic acid 1,2-Propanediol
pentane 1,1-Dichloroethane butane-1,4-diol
1,3-dimethylbenzene 1,2-dihydrostilbene N, N-diethylethanamine
tribromomethane 2-propoxyethanol 1,2-Dichloroethane
1-propanol water phenylamine
heptane trichloromethane pyridine
cyclohexene cyclohexane Methanol
1,2-dimethoxyethane 3-pentanone fluorobenzene
epichlorohydrin acetonitrile dichloromethane
methanedithione 1-Octanol butanedioic acid
N, N-dimethylformamide 1,2-ethanediamine 2,4-pentanedione
o-Xylene Propylene glycol monomethyl ether acetate 1,3,5-trimethylbenzene
2-phenylacetonitrile 2-Chlorotoluene 1,2-dichlorobenzene
isophorone morpholine nitric acid
quinoline benzonitrile ethyl acetate
benzene-d6
Table S4. Performances of various models with different feature compositions for the
valid set of 10-fold cross validation.
Model PACC (%) NACC (%) BACC (%)
SVM 98.99 (ยฑ0.39) 87.55 (ยฑ2.72) 93.27 (ยฑ1.44)
RF 99.89(ยฑ0.06) 91.00 (ยฑ2.70) 95.44 (ยฑ1.34)
DNN 99.53(ยฑ0.29) 90.46(ยฑ2.34) 95.00(ยฑ1.07)
NCI1 99.01(ยฑ0.50) 85.96(ยฑ3.56) 92.49(ยฑ1.63)
enn-s2s 98.44(ยฑ0.45) 86.96(ยฑ3.68) 92.70(ยฑ1.76)
CCGNet-simple 99.46(ยฑ0.45) 93.45(ยฑ2.45) 96.46(ยฑ1.05)
CCGNet 99.89(ยฑ0.13) 96.98(ยฑ2.20) 98.43(ยฑ1.12)
Table S5. Refcodes of energetic cocrystals collected from CSD for the out-of-
distribution prediction
ABTNBA01 ABUNIU AJAKOL ANCTNB APANBZ
BIYXAL BIZZAO BNZTNB BZATNB20 CAZTBZ01
CBZTNB CECPEF CEZFOF DIFZOK DUKBOC
DUKBUI DUKCAP ERAFAE FETYAE FONHOH
FONJAV FUFSOQ GEXMAZ GEXMED GEXMIH
GEXMON HECREM HETTIM HETTOS HETTUY
HIVGAW HUZSEA IZUZUZ IZUZUZ01 JABYIX
JABYOD JOCTAZ KIZVAQ KOBFIQ KUMYOI
LOKJIH LUTGUD MAAZNB NIBJUF NIBZAM
NIKLOL NILCET POCVIP POSREV PUBMUU20
PUBWEO PUTWEI PUTWIM PUTWOS PUTWUY
PUTXAF PUTXEJ PVVBFD01 PVVBKP01 PYRTNB
QAPNAZ QARQUY QINLEH QOSRUN QOWBEJ
REDCIM REDCUY REDDAF REDDEJ RENPUV
RULLUF RUYKUR RUYLAY RUYLEC SERZIB
SKTNIB SOQPAQ STINBZ SUGCAY TETTAQ
TIVJUF TOZMUS UGUNAN URIHUZ URIJEL
URIJUB URILAJ USEZID VAZBIJ VIGKIF
VIGKUR VIGLEC WEPGEG WEPTAP WOJWIB
WOJWOH WOJXEY XAHZAH XAJJUQ XEMCID
XIZCER YEDVAH ZEBJOH01 ZEGKIF10 ZEVNUL
ZEZGIW ZEZHET ZILMUF ZOPGOC ZUBNOB
ZUBNUH ZZZAGS10 YOJQOG YOJXIH YOJXON
NILCIX ZEBJOH WOSFOB PEHSUS XAQFUS
ZASWAT ZASWEX ZASWIB GOWHIL ROSMOD
ROSMIX JAQVOP UWUGAW JABYIX MANLEV
BOXTET WUGWAY WIFYAN WIFXUG IDENEM
ZEZGOC ZEZHAP ZEZHOD URIJAH URIJIP
URIMAK URILOX URIKOW URILEN URIKUC
URIKIQ URIJOV URIKEM URIKAI UTEJAG
MEPWIQ FOYSUJ
Reference
1. Ioffe, S.; Szegedy, C., Batch Normalization: Accelerating Deep Network Training
by Reducing Internal Covariate Shift. 2015.
2. Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.;
Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R., Relational
inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261
2018.
3. Such, F. P.; Sah, S.; Dominguez, M. A.; Pillai, S.; Zhang, C.; Michael, A.; Cahill,
N. D.; Ptucha, R., Robust Spatial Filtering With Graph Convolutional Neural Networks.
IEEE J. Sel. Top. Signal Process. 2017, 11 (6), 884-896.
4. Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; Dahl, G. E., Neural message
passing for quantum chemistry. arXiv preprint arXiv:1704.01212 2017.
5. Cho, K.; Van Merrienboer, B.; Bahdanau, D.; Bengio, Y., On the Properties of
Neural Machine Translation: Encoder-Decoder Approaches. Computer ence 2014.
6. Vinyals, O.; Bengio, S.; Kudlur, M., Order Matters: Sequence to sequence for sets.
Computer ence 2015.
7. Hochreiter, S.; Schmidhuber, J., Long Short-Term Memory. Neural Computation
1997, 9 (8), 1735-1780.