Arti cial intelligence in drug design: generative ...

Submitted byIsaac Lazzeri

Submitted atInstitute ofBioinformatics

SupervisorUniv. Prof.Dr. Sepp Hochreiter

Co-SupervisorMag. Dr. GunterKlambauer

02 2018

JOHANNES KEPLERUNIVERSITY LINZAltenbergerstraße 694040 Linz, Osterreichwww.jku.atDVR 0093696

Artificial intelligence indrug design: generativeadversarial network formolecules generation

Master Thesis

to obtain the academic degree of

Master of Science

in the Master’s Program

Bioinformatics

i

Acknowledgment

I would like to express my deep gratitude to Dr. Günter Klambauer and Pro-

fessor Dr. Sepp Hochreiter, for their patient guidance and useful critiques

of this master thesis. I would also like to thank all people working at the

Bioinformatics department, for their constructive recommendations and

help.

Finally, I wish to thankmy family and Sandra for their support and encour-

agement throughout my studies.

ii

Contents

Abstract ix

Zusammenfassung xi

1 Introduction 1

1.1 Soft introduction inMachine learning . . . . . . . . . . . . . . 1

1.2 FromNeural Networks to Deep Learning . . . . . . . . . . . . 5

1.2.1 History of Neural Networks . . . . . . . . . . . . . . . . 5

1.2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . 10

1.3 Generative Adversarial Network (GAN) . . . . . . . . . . . . . 26

1.4 Machine Learning in chemoinformatics . . . . . . . . . . . . . 30

1.5 Aims of themaster thesis . . . . . . . . . . . . . . . . . . . . . . 34

2 Methods 39

2.1 Simplifiedmolecular-input line-entry system . . . . . . . . . 39

2.2 Molecular fingerprints . . . . . . . . . . . . . . . . . . . . . . . . 40

2.3 ChEMBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.4 Data-sets preparation . . . . . . . . . . . . . . . . . . . . . . . . 44

2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.5.1 Tanimoto coefficient . . . . . . . . . . . . . . . . . . . . 49

iii

iv CONTENTS

2.5.2 Fréchet inception distance . . . . . . . . . . . . . . . . . 50

2.5.3 Fréchet Tox21 Distance . . . . . . . . . . . . . . . . . . 51

2.6 Chemo-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.7 Latent-Space-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.7.1 Auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.7.2 Generator and Discriminator . . . . . . . . . . . . . . . 57

3 Results 63

3.1 Results Chemo-GAN . . . . . . . . . . . . . . . . . . . . . . . . . 63

4 Discussion 71

5 Conclusion 77

List of Figures

1.1 Example of decision tree . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Perceptron,F. Rosenblat 1957-1958 . . . . . . . . . . . . . . . . 7

1.3 Logistic function and Heaviside function . . . . . . . . . . . . 11

1.4 Structure of an artificial neuron. . . . . . . . . . . . . . . . . . . 12

1.5 Forward pass for one layer . . . . . . . . . . . . . . . . . . . . . 13

1.6 Deltas propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.7 Structures of recurrent neural networks . . . . . . . . . . . . . 22

1.8 LSTM cell structure . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.9 DCGAN faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.10 Structure of a generative adversarial network . . . . . . . . . . 30

1.11 Workflows for QSAR and QSPR . . . . . . . . . . . . . . . . . . . 34

2.1 ECFP generation process . . . . . . . . . . . . . . . . . . . . . . 42

2.2 ChEMBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.3 Sequence length distribution of SMILES strings . . . . . . . . 47

2.4 Character frequency in ChEMBL . . . . . . . . . . . . . . . . . 48

2.5 Latent-Space-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.6 Accuracymeasured during training for the auto-encoder us-

ing the linear latent space. . . . . . . . . . . . . . . . . . . . . . 58

v

vi LIST OF FIGURES

2.7 Accuracymeasured during training for the auto-encoder us-

ing the sigmoidal latent space. . . . . . . . . . . . . . . . . . . . 59

2.8 Distribution of the percentages of valid SMILES strings per

generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.1 Results Chemo-GAN: FTOXD and Tanimoto coefficient . . . 64

3.2 Results Chemo-GAN: FTOXD and Tanimoto coefficient per

group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3 FTOXDmeasured each 500 updates for the Chemo-GAN . . 67

3.4 Learning curves of the Chemo-GAN . . . . . . . . . . . . . . . 68

3.5 Learning curves for the Chemo GAN . . . . . . . . . . . . . . . 69

4.1 100 generated chemicalc compounds . . . . . . . . . . . . . . 73

4.2 FTOXDmeasured for the Latent-Space-GAN for valid gen-

erated SMILES strings sampled from priors with different

SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

List of Tables

1.1 Example of data-set . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Timeline of the history of Neural Network . . . . . . . . . . . . 9

2.1 Data-set for Chemo-GAN . . . . . . . . . . . . . . . . . . . . . . 46

2.2 Data-set for Latent-Space-GAN . . . . . . . . . . . . . . . . . . 48

2.3 Comparison between original and generated SMILES strings 62

vii

viii LIST OF TABLES

Abstract

Introduction and background: generation of new chemical compoundsplays a key role in drug discovery, but in-silico methods based on hand-crafted rules can only cover a tiny part of the synthetically available chem-ical space. Therefore computational methods able to automatically extractrules from data are desirable. The aim of this work is to adapt GenerativeAdversarial Networks (GANs) to generate novel chemical compounds.Methods: Molecular fingerprints and canonical SMILES strings were re-trieved fromtheChEMBLdata-set through theRDkit packageandpython. AGANwas implementedusingKeras andTensorflowand trainedusing chem-ical fingerprints. This model was called Chemo-GAN and implemented asfully connected deep neural networks. This is, to the best of our knowledge,the first time that GANs were used to generate molecular descriptors. Asecond model was implemented using an autoencoder to map SMILESstrings from and to a latent space and a GAN trained to generate this latentspace representation of SMILES strings. The aim of this second approachwas to obtain a generator able to generate latent space representationsof SMILES and a decoder able tomap them back to SMILES strings. Thismodel was called Latent-Space-GAN. To evaluate the distance between thedistributions of original data and generated ones and to assess samplesquality, a new distancemeasure, which was called FTOXD, was designedand calculated during training giving a way to tackle the hard problem ofgenerative model evaluation.Results: Chemo-GAN successfully approximated the original data distri-bution, producingmolecular fingerprints of high quality, as showed by theFTOXD, which decreases along the training process. Latent-Space-GAN isable to produce SMILES strings with a low FTOXD, but the percentage ofvalid SMILES strings is low. However, it keeps increasing along the train-ing, suggesting the possibility of getting better performances after a longertraining.

ix

x ABSTRACT

Zusammenfassung

Einleitung undHintergrund: die Entwicklung neuer chemischer Verbin-dungen spielt eine wichtige Rolle in der Arzneimittelentwicklung, aberin-silicoMethoden, die auf handgemachten Regeln basieren, können nureinen kleinenTeil des synthetisierbaren chemischenRaums abdecken.Des-halb wären Berechnungsmethoden, die diese Regeln automatisch findenkönnen, wünschenswert. Das Ziel dieser Masterarbeit ist es, GenerativeAdversarial Networks (GANs) für die Entwicklung neuer chemischer Ver-bindungen zu adaptieren.Methoden: Molekulare Fingerabdrücke und kanonische SMILES Sequen-zen wurden von der ChEMBL Datenbank mittels des ProgrammpaketsRDKit abgerufen. Ein GANwurde in Keras und Tensorflow umgesetzt undauf chemischen Fingerabdrücken trainiert. DiesesModell, genannt Chemo-GAN, wurde in Form von vollständigen verbundenen tiefen neuronalenNetzwerken implementiert. Das ist, soweit den Autoren dieser Studie be-kannt, das ersteMal, dass GANs für die Entwicklungmolekularer Finger-abdrücke benutzt wurden. Eine zweite Methode, die einen Autoencoderverwendet, um die SMILES Sequenz in einem latenten Raum abzubilden,wurde entwickelt. Dazu wurde ein GAN trainiert, um diese Abbildung imlatenten Raum generieren zu können. Das Ziel dieser zweiten Methodewar, einen Generator zu erhalten, der eine Darstellung erzeugen kann,die in eine SMILES dekodiert werden kann. Dieses zweite Modell wurdeLatent-Space-GAN genannt. Um die Distanz zwischen der Verteilung deroriginalen Daten und die der generierten Daten und um die Qualität dergenerierten SMILES Zeichenfolge zu bewerten, wurde eine neue Distanzund Qualitätskriterium, genannt FTOXD, entwickelt und evaluiert.Ergebnisse: das Modell Chemo-GAN konnte die Verteilung der originalenDaten erfolgreich annähern und molekulare Fingerabdrücke von hoherQualität erzeugen, wie mittels FTOXD gezeigt werden konnte. DasModellLatent-Space-GAN kann gültige SMILES Sequenzen generieren, die eine

xi

xii ZUSAMMENFASSUNG

niedrigen FTOXDDistanz besitzen, aber der Anteil der gültigen SMILES Se-quenzen ist niedrig. Jedochwurde ebenfalls gezeigt, dass durch extensivereModellselektion bessere Ergebnisse erzielt werden könnten.

1. Introduction

In the following chapters a background needed to understand this thesis

is given. Along with this introduction, a general explanation of machine

learning and a detailed one concerning neural networks, Deep Learning

and generative adversarial networks are provided. These are the basis on

top ofwhich the coremethods of this thesis are built. Furthermore, the field

of chemoinformatics is introduced together with an explanation about the

importancemachine learningmethods have in it.

1.1 Soft introduction inMachine learning

Every day, billions of data are generated and stored. Smart-phone applica-

tions, health-care systems, bank-accounts, social-networks andmanymore

examples are just some of the systems, which are continuously generating

and storing data. The technological development we have been witnessing

during the last decades gave rise to the newproblemof how to use this huge

quantity of data to generate new knowledge. In this environment machine

learning arose [31].

In machine learning, data are fed in algorithms, whose aim is to learn from

them and solve a specific task.

Examples of tasks are: object recognition, stockmarket trends prediction,

1

2 CHAPTER 1. INTRODUCTION

Table 1.1: Example of data-set

wei g ht (x1:n,1) e yes number (x1:n,2) leg s number (x1:n,3) l abel (y1:n)x1,1:m = 10 2 4 dogx2,1:m = 65 2 2 humanx3,1:m = 0.85 8 8 spi derxn,1:m = ... ... ... ...

photos’ captions generation, sequence to sequence translation ecc., but

what does it mean for an algorithm "to learn"? When we think about learn-

ing processes, we generally associate themwith human or animal behavior.

People learn to associate names with persons or things and themore often

they come across with them the better they can recognize them. People in

this case learn to associate features of these persons like: body size, voice

tonality, hair color ecc. with their names and these associations become

each time stronger. But learning is not only associating things or mem-

orizing them. Indeed, it sometimes happens to come across with things

with specific characteristics we associated with fear and experience this

sensation even without having ever seen that thing before. This generaliza-

tion process, which brought this association to our mind, represents the

real difference betweenmemorizing and learning, algorithm with hand-

crafted rules andmachine learning ones. In machine learning, algorithms

"observe" samples/objects composed by vectors of features which may

have labels representing themembership classes. Samples define a design

matrix X ∈Rn∗m , where X1,1:m is equal to the first sample and X1:n,1 is equal

to the first feature vector. The designmatrix together with the labels vector

constitutes a data-set. The data-set represents the "experience" algorithms

have [31].

1.1. SOFT INTRODUCTION INMACHINE LEARNING 3

This "experience" is used to build a model of the perceived reality upon

which decision and conclusion concerning the task to solve are taken[31].

Machine learning involvesmanydifferentalgorithmsand techniques,which

can be generally grouped in three fields:

• Supervisedmachine learning (SML)

• Unsupervisedmachine learning (UML)

• Reinforcement learning (RL)

These three fields differ from one another in the kind of "experience" the

algorithms are allowed to experience. Indeed, while in SML algorithms

observe both features representing the samples and the labels defining the

membership classes, in UML, they are only allowed to inspect the features.

In RL instead, algorithms act in an environment carrying out actions and

observing their consequences, understanding through this trial-error ap-

proach, which actions and states are favorable [31], [5], [52].

An example of technique used in SML is decision tree. This method tries to

build a tree composed by nodes, which represent rules, leaves, which rep-

resent classes or predictions, and edges, which connect nodes with nodes

or nodes with leaves. A sample follows a path from the root to a leaf. The

path followed and the leaf reacheddependon the rules encountered at each

node and the values of the features of the samples being classified. During

the training process, rules are defined by splitting the training set in smaller

sets with the aim of making them each timemore homogeneous. This is

achievedwhen the splitting criteria, whichmaximizes the information gain,


is found. The information gain is so defined:

I G =H (S)−nX

j=1

S j

S

¶H (S j ) (1.1)

H (S) =−kX

i=1pi (S) log(pi (S)) (1.2)

where H is the entropy, S is the set of samples in the parent node, n are

the categories in the feature vector, k are the classes and {S j | j = 1, ...,n}

are the sets obtained after applying the splitting criteria. In the case of

categorical data, samples are grouped by categories present in a specific

feature vector and the feature vector, which maximizes the information

gain, is selected. In the case of numerical data, the splitting criteria is a

value, which acts as a threshold. For a specific feature vector, samples

having values greater than the given threshold are grouped together and

so those having them lower. The feature maximizing the information gain,

is accordingly selected [31], [5]. An example of unsupervised technique is

x = 2

N°of legs

x < 1K g

Weight

Spider

Y

DogN

N

Human

Y

Figure 1.1: Example of decision tree

1.2. FROMNEURAL NETWORKS TODEEP LEARNING 5

k-means. In this method no labels are required. Each sample represents a

point in a multidimensional space where the number of dimensions are

defined by the number of features a sample has. A defined number of

cluster centers are initialized and the points are assigned to the closest

cluster. After the assignment of all points, the positions of the centers are

updated as themean distance of the assigned points to the cluster center,

where the distancemeasure is previously defined and can be a distance like:

the Euclidean distance or theManhattan distance. These three phases are

carried out until the center coordinates are not changing or some stopping

criteria has been reached [5].

1.2 FromNeural Networks to Deep Learning

1.2.1 History of Neural Networks

At the end of the 19th century C. Golgi discovered the black reaction, which

also became famous as the Golgi stainingmethod, enabling the visualiza-

tion of neurons at the light microscopy [43]. Using this techniques, Ramon

Cajal began its studies about the nervous system [61]. In the meanwhile,

Sigmund Freud postulate the neuron theory [44], which was further expan-

ded in the successive years by C.S. Sherrington, who formalized his concept

suggesting that neural cells form a network and can communicate one an-

other through pathways along it [63]. These facts together with Shennon’s

theory of information laid the basis for themodels proposed byMcMulloch

and Pits in 1943 [41]. Indeed, they recognize that neurons through their

"all-or-none" activation, can accomplish logic operations AND, OR and

NOT and that networks of neurons can definemore complicated logic or


arithmetic functions. Someyears laterDonaldHebbproposed that learning

and experience are carried out in terms of synaptical changes as he stated:

"When an axon of cell A is near enough to excite a cell B and repeatedly or

persistently takes part in firing it, some growth process or metabolic change

takes place in one or both cells such that A’s efficiency, as one of the cells

firing B, is increased.".

Based on this rule and the previous work of McCulloch and Pits, Marvin

Minsky built the first neural computer "The SNARC" [15], which however,

did not work as expected. It was instead theMark 1 Perceptron [24] the first

working neural computer, whichmodeled the first ArtificialNeuralNetwork

architecture composed of one single layer of neurons [35]. The Perceptron

is an exemplification of the influence the neuronal theory played during

these years. Its architecture clearly resembles that of a nervous cell, which

receives impulses from other cells through dendrites and transmit them

along the axon with an "all-or-none" response. In the Perceptron, this is

modeled through the Heaviside step function and a weighted sum of the

inputs. The Heaviside step function is a discontinue function, which is

equal to one for all x greater than zero and zero otherwise. In this model, it

plays the role of a switch or activation of the neuronal cell transmitting a

signal when the input, in this case the weighted sum, surpasses a certain

threshold [35]. In this model, the learning process was carried out adjust-

ing the weights of the "network", which, in the Mark 1 Perceptron, were

represented by potentiometers connected to an array of 400 photocells

transmitting the input and updated through electric motors [5]. A further

success was the ADALINE (ADAptive LInear NEuron), which was invented

byWidrow and Hoff, who usedmemistors -resistors withmemory- to real-


x11

x12

x13

x1...

x1n

Σ f

x0

f =

1 ifnP

j=1

‡xjwj

·0 otherwise

w0

y

f

ˆnP

j=1

‡x j w j

·!

w5

w4

w3

w2

w1

Figure 1.2: Structure of the Perceptronmodel proposed by F. Rosenblat in1957-1958. Y is the result of a weighted sum of the inputs x, in which x arethe input values and w the weights, and a nonlinear transformation frepresented in the upper right corner, which is generally called Heavisidestep function.

ize an adaptive neuron for pattern classification [ 57], [58].Themodel was

similar to the Perceptron but it received and transmitted signals as plus or

minus one. It also included a fix input equal to one regulated by an extra

weight [57]. This was the first commercially used neural network device,

which after the 1960 was used in themajority of analogical telephones as

echo filter s[35]. The excitement for these discoveries and breakthroughs

hada significant impact, andoptimismand interestweregrowingenhanced

by some scientists’ far-too-optimistic statements, which, however, played a

key role in attracting public and private funding [35]. This first hype came

to an end in 1969 whenMarvinMinsky and Seymour Papert published the

book "Perceptron", in which theymathematically analyzed the weaknesses

of the Perceptron and its similar approaches proving in a rigorousmanner

that these methods were not able to solve trivial problems like the XOR

where data are not linearly separable [36]. As a consequence, the interest

and the funding for this field started diminishing. This field, which until


that point had been seen with enthusiasm and optimism, entered in its

first winter considered to be a deed-end [35]. Despite the little funding,

this field survived the winter, which lasted until the 1980s. During these

years, most of the research on neural network was carried out in signal

processing, biological modelling and pattern recognition as can be ob-

served in the work of T. Kohonen, who in 1972 suggested a linear model

for associative memory proposing its use for patterns classification and at

the beginning of the 1980s described Self OrganizingMaps(SOM) [36]. In

the same years Werbos, which had been inspired by Freud’s psychological

theories and a previous paper of Minsky about the use of reinforcement

learning to address general-purpose AI problems, had the intuition of using

back-propagation to train neural networks, solving in this way the problem

of the Perceptronmodel of Rosenblat encountered, ironically, byMinsky

and Papert. He proved that with the use of a differentiable function and

the chain rule, he could train amulti-layer Perceptron enabling it to solve

non-linear problems as the XOR. However, the effect the publication of

the Perceptron had, was still too strong and no one considered the idea of

publishing this discoveries until 1986 when "Learning representation by

back-propagating errors" was published [47], [36]. As Werbos recounted:

"In the early 1970s, I did in fact visit Minsky at MIT. I proposed that we do a

joint paper showing that MLPs can in fact overcome the earlier problems ...

But Minsky was not interested. In fact, no one at MIT or Harvard or any

place I could find was interested at the time" [56].

After the publication of Rumelhart’s paper and thanks to the efforts of Hop-

field, who managed to attract the interest of new researcher in the field,

the field of Neural Network begun its second golden age, which officially


started with the first conference on neural network and the foundation of

the INNS international Neural Network Society in 1987.

Table 1.2 Timeline of the history of Neural Network

1943 • McMulloch and Pits neuron as logic operations approximator

1949 • Hebb’s rules

1951 • MarvyMinisky builds the SNARC

1957 • Frank Rosenblat builds theMark 1 Perceptron

1960 • BernardWidrow andMarcian E. Hoff develop the ADALINE

1969 • MarvinMinsky and Seymour Popert publish the "Perceptron"

1972 • Teuvo Kohnen: Linear associator model of associative memory

1973 • Christoph vanderMalsburg: nonlinear neuron

1974 • Poul Werbos: backpropagation

1976 • Stephen Grossberg and Gail Carpenter: adaptive resonance theory

1982 • Teuvo Kohonen: Self OrganizingMaps (SOM)

1983 • Fukushima, Myake, Ito: Neucognitron

1983 • John Hopfield network

1985 • John Hopfield Nets for the solution of the travelling seltsman problem


1986 • David Rumelhart, Geoffrey Hinton, and RonaldWilliams rediscover the

back-propagation algorithm and publish the "Learning

representations by back-propagating errors"

1986 • Rumelhart andMcClelland: publication of the "PDP book"

1987 • IEEE: First open conference on neuronal networks.

1987 • Foundation of the INNS International Neural Network Society

1988 • Foundation of the INNS journal Neural Networks

1989 • Foundation of the Neural Computation

1989 • F. Rosenblat: "Multilayer Perceptron are universal approximators"

1989 • LeCun: hand written digit recognition

1990 • Foundation of the IEEE Transactions on Neural Networks

1.2.2 Artificial Neural Networks

Artificial Neural Networks(ANN) are composite parametric non-linear func-

tions, which were inspired by the neuronal theory and in specific by the

neuron structure [5]. The fundamental part of eachnetwork is the "neuron",

which is also called unit. An ANN has input, hidden and output units. They

are connected by edges, whose strength is defined by weights. Each hidden

and output unit is defined as the weighted-sum of its inputs followed by a

nonlinear function called activation function. [5] Examples of activation

functions are the logistic function, theHeaviside step function and the tanh

function which were the first ones used.


y = 1

1+e−x (1.3)

y =

1 if

nPj=1

¡xjwj

¢0 otherwise

(1.4)

Figure 1.3: Equation 1.3 represents the Logistic function or sigmoidfunction. Equation 1.4 represents the Heaviside step function

Units can be grouped in layers, which are composed by those units having

the same distances from the inputs. An artificial neural network is said to

be fully connectedwhen all the units in a layer are connected to all the units

in the successive one. Following the nomenclature proposed in [5], which

suggests defining the number of layers of a neural network as the trainable

ones, artificial neural networks are said to be shallow or deep when they

have, respectively, one and more than one hidden layers. ANNs can be

represented as directed acyclic graphs (i.e. feed forward neural networks)

or as directed cyclic graphs (i.e. recurrent neural networks).

The training of a neural network can therefore be considered a two

steps process: a first propagation of the information from the input layer

to the output layer through the hidden ones, which is called forward pass,

and the calculation of the prediction error between the outputs and the

targets followed by its back-propagation to the inputs, which is called the

backward pass. The back-propagation of the error allows the assessment

of the contribution of each neuron to the overall error defining so an index

of the magnitude the weights need to be tweaked to reduce it. [5] Until

the discovery of this method by P. Werbos, the training of a neural network


x11

x12

x13

x1...

x1n

Σ f

xb

y

w5

w4

w3

w2

w1

y = fact(b +wTx) (1.5)

Figure 1.4: Equation 1.5 represents the computation of the value of anartificial neuron. fact stands for activation function.

withmultiple layers was thought to be impossible and so the training of a

Perceptron able to solve the XOR problem, which - for this pupose - would

have needed a hidden layer and a differentiable activation function as

affirmed by K. Hornik, who published in 1990 an article in which he proved

that a feed forward neural network can approximate whichever continuous

function arbitrarily well if the activation function is continuous, bounded

and non-constant and the hidden layer has enough hidden units [29].

Training of an Artificial Neural Networks

In this part a more mathematical explanation of the training of a neural

network is given.

Let’s define a data-set D to be:


l −1 l

xn

x...

x1

Σ

Σ

Σ

fact

fact

fact

yn

y...

y1

yl = fact((Wl )Tal−1 +bl ) (1.6)

Figure 1.5: Equation 1.6 represents how all the values of the units in onelayer are calculated. In this caseW ∈Rn∗m where n represents the numberof inputs andm the number of units in the hidden layer, a ∈Rmrepresentsthe activation vector of the previous layer or the input vector and l is theindex of the layer. Each value y is the result of a weighted sum of theactivation of the previous layer and a nonlinear transformation.

D = (x(i ), y (i )) (1.7)

where x(i ) ∈Rn represents a sample of the data-set D having n features and

y(i ) defines the target of sample i.

Let’s define a Loss function as:


L = L(y(i),g(x(i);w)) (1.8)

where the g (x(i );w) is a ANN parametrized by the weights w. Many different

Loss functions were used in the ANN’s literature. Among these, theMean

Squared Error (MSE), theMean Absolute Error (MAE), the Kulback Leibler

Divergence (KL) and the Cross entropy are some of the most commonly

used ones. They are so defined:

LMSE = 1

n

nXi=1

(y (i ) − y (i ))2 (1.9)

LMAE = 1

n

nXi=1

|y (i ) − y (i )| (1.10)

LKL = 1

n

nXi=1

(DK L(y (i )||y (i )))

= 1

n

nXi=1

(y (i ) log(y (i )

y (i ))) (1.11)

= 1

n

nXi=1

(y (i ) log(y (i )))| {z }entropy

− 1

n

nXi=1

(y (i ) log(y (i )))| {z }cross−entropy

Lcross−entropy = 1

n

nXi=1

(y (i ) log(y (i ))) (1.12)

The importance of a Loss function in the training of a ANN lies in the fact

that it measures the error between the values predicted from the ANN (y)

and the targets (y) and its value is inversely proportional to the robustness

of a model and it can influence the way amodel takes errors into account.

This can be easily seen comparing the MSE and the MAE. Indeed, in the

first case large errors would have amuch greater impact due to the square,

while the second one is more robust against outliers. This error is generally


called empirical error and it is an evaluation of the model based on data

whichmay not fully represent the real data distribution. Therefore, amodel

trained on this data could incur in over-fitting or under-fitting problems,

where the first one refers to a model able to fit the training data well but

that is not able to generalize to unseen data and the second one refers to a

too simplemodel that cannot capture the complexity of the data [5], [31].

Instead ofminimizing the empirical error, the aim of amodel should be the

minimization of the generalization error defined as the expected loss on

feature data. One approach to have an estimation of this error is the test

set method. This approach consists in splitting the data-set in training and

validation sets using the first one to train the model and the second one

to evaluate it. To have an unbiased estimation of the error of themodel, a

third set needs to be selected, which is only used after themodel selection

to evaluate the quality of the final selected model. These concepts are

especially important when using highly expressivemodels like ANN, which

can easily over-fit [17],[5].

Back-propagation

In practice, ANNs are trained using the training set and evaluated on a

test set which provides an estimation of the generalization error. On the

one hand, such estimation gets better when the number of test samples

increases, on the other hand the goodness of themodel increases when the

number of training samples increases [5]. Therefore, a trade-off between

the number of training samples and the number of test ones is needed.

The training process is carried out using the back-propagation algorithm

showed in Algorithm 1. As it can be seen in the description, this algorithm


consists of two procedures carried out for each sample and target in the

data-set,whichare: theForward-Passand theBackward-Pass. In theForward-

Pass for each samples of thedata-set, the input layer of theANN is initialized

with the values of its features, then the inputs are propagated through the

network:

y(0) = x(1) (1.13)

netl = (Wl )Ty(l−1) +b(l ) (1.14)

y(l+1) = fact(netl ) (1.15)

Using a previously selected loss function as those described in equations

(1.9) - (1.12), the error between the value predicted by the ANN an the real

one is calculated. To assess the proportion in which each weight particip-

Algorithm 1 Back-Propagation1: procedure Back-Propagation(D,L, AN N) . Forward Pass2: for ❡❛❝❤ (①i ,②i ) ∈ D do3: y0 := ①i . Initialization of the input layer4: for ❡❛❝❤ l ∈ (1, ...,L) do5: y l := fact ((❲l )T y l−1 +❜l ) . Propagation of the information6: end for7: . Backward Pass8: δL := ∂

∂netoutL(yi , yL) . Calculation of the output error

9: for ❡❛❝❤ l ∈ (L−1, ...,1) do10: δl := (❲l+1δl+1)fl f 0l (♥❡tl ) . Back-propagation of the deltas11: ❲l := (❲l −ηδl (y l−1)T ) .Update of the weights12: bl := (bl −ηδl ) . Biases update13: end for14: end for15: end procedure


ated in the error, the gradient of the loss function respect to a specificweight

is calculated:

∇wi j =∂L

∂wi j= ∂L

∂net j

∂net j

∂wi j(1.16)

where net j in this case is equal to linear combination of the weights con-

necting theprevious layer to unit j and the values of the units in theprevious

layer (w j )Tyi . ∂L∂net j

is called the delta and visualized asδ j . It represents the

error that is back propagated to the previous layer. The deltas for the output

layer are easily calculated in the case of the halved quadratic loss as:

δout = ∂

∂netout

1

2(y (i ) − f (netout ))2

¶= ( f (netout )− y (i )) f 0(netout ) (1.17)

and:

∂netout

∂wi out= ∂

∂wi out(wout )Tyi = y i

i (1.18)

so:

∇wi j = δout y ii (1.19)

and the calculation for the weights in a hidden layer becomes:


y (l−2)1

y (l−2)2

y (l−2)3

y (l−2)4

y (l−2)5

y

Figure 1.6: The deltas propagate back through the network as a weightedsum of the deltas in the output layer times the derivative of the activationfunction (blue arrow). The weights in the hidden layer are calculatedconsidering the deltas (blue arrow) and the activation functions in theancestor layer (red arrows).

∂L

∂W l−1=δl−1(yl−2)T (1.20)

δl−1 = f 0(netl−1)fl (wout )Tδout (1.21)∂L

∂W l−1= f 0(netl−1)fl (wout )Tδout (yl−2)T (1.22)

From the last formula is it possible to understand that the deltas are propag-

ated back and the weights of a hidden layer are calculated taking into ac-

count the activation of the ancestor layer and the errors coming from the

successor one.


Gradient descent basedmethods

The update rule used in Algorithm 1 to tweak the parameters w of the

network is calledgradientdescent and it is a general optimization technique

used to find a solution to a problem byminimizing a cost function. It looks

at the gradient of the cost function respect to each weight and it updates

that weight in a direction of the negative gradient, which represents a step

downhill in thedirectionof the steepest slope [5], [17]. The lengthof the step

is defined from another parameter called the learning rate and defined as η

in algorithm 1. It is an important hyper-parameter, whichmust be taken

into account during the training of a ANN, because it defines the number

of iterations the algorithm needs to approach theminimum (the smaller

the step, the more steps are needed and therefore more time is needed

too) and can lead to overstep theminimum or jump around it in case it is

set to a too-large value, especially in the case of functions withmany local

minima [17]. During the past years improvement to this algorithm were

proposed to achieve a better and faster convergence and an example of

these improvements is Stochastic Gradient Descent (SGD), RMSProp [26]

algorithm and the Adamone [17]. The first one takes only a random sample

at time to calculate the gradient and it updates the weights according to it.

In this way it avoids using all the data for each update speeding the process

up and reducing the memory usage [17]. The second one uses the idea

of rprop of dividing the gradient by it’s magnitude and applied it tomini-

batches. It does it by keeping a moving average of the squared gradient

and dividing the gradient by it during the calculation of the update of a


mini-batch as shown in equations: 1.25, 1.26 [17].

θ := θ−η∇θ J = θ−η 1

n

nXi=1

∇θ J (θ) (1.23)

θ := θ−η∇iθ J (θ) (1.24)

s :=βs + (1−β)∇θ J (θ)⊗∇θ J (θ) (1.25)

θ := θ−η∇θ J (θ)fips +† (1.26)

Deep Learning

Deep learning is a term used to define the use of Deep artificial neural

networks (DNNs) to accomplish a specific task. DNN are basically ANN

withmore than one hidden layer. In 1991 and 1998, Hornik and Cybenko

respectively reached the conclusion and proved that a ANN can approxim-

ate whichever continuous function provided that there are enough hidden

units in the hidden layer [29]. However, the proof did not define howmany

units are enough, which could be an extremely large number as suggested

in [31] for the case of ANNwith one hidden layer and proved by Eldan and

Shamir in [14] for 2-hidden-layers network. In their work, they demon-

strated that there is a function approximated with a 3-layer ANN, which

cannotbe approximatedwith a 2-layers oneunless thenumberof its hidden

units is exponential in the dimension, suggesting that the use of deep net-

works can achieve same or better results usingmore layers with less units.

Another problem concerns the way to find the function that approximate

the one of interest. Indeed, the trainingmay fail choosing another function

because of over-fitting, or the optimizationmethod usedmay hinder the

selection of the right parameters needed to approximate it [31]. Instead,

DNNwere proved to generalize better inmanyworks and this together with


the fact that exponentially less neurons per layer are needed, make DNN

interesting.

The learning process in DNNs can be interpreted as a hierarchical process

in which in each hidden layer, from the first one to the last one, more and

more abstract features are learned. This can be especially well observed

in DNN applied to images tasks. Indeed, in the case of facial images, in

the first hidden layers, features like edges and corner are learned, while in

the successive ones, details, like noses, eyes, ecc., can be appreciated. The

use of DNN to capture complex patterns had been already hypothesized

during the 80s but the computational power was not enough and so also

the amount of data available. Another problem encountered was related

to the lack of amethod able to train deep networks. Indeed, the gradient

used to calculate the weights update for DNNwas observed to vanish or

explode when back-propagating the error through the layers, making the

training of DNNwithmany layers impossible. These problems were often

observed and caused a loss of interest about DNN until it was rescued by

the success of the article published by Hinton and Osindero [27], who sug-

gested that using a good initialization would allow the training of a deep

network. Specifically, they used an unsupervisedmethod to initialize the

weights of each layer of the network and showed that this method achieved

the best performance on theMNIST data-set, which is a standard data-set

used to evaluate amodel’s performances inmachine learning. One of the

most important breakthrough, was the discovery of the role played by the

sigmoid activation function in the vanishing of the gradient. Other import-

ant discoveries were: the importance of the layer’s initialization described

by Glorot and Bengio in [18] and the Rectifier-Linear-Unit as an alternative

activation function. Indeed, this activation function does not saturate as


the sigmoid one and is more easily calculable, allowing in this way not only

a better flow of the gradient back, but also a speeding up of the process.

Results obtained using this new non-linearity and a better initialization

were even better than those obtained using unsupervised training of each

layer. This provided an end-to-end learning system able to extract more

andmore complex features along the layers andmapping input to output

directly, minimizing so hand-crafting [31], [36]. Deep learning methods

achieved state-of-the-art results in many different fields like: speech recog-

nition, machine translation, text generation andmusic tracking bringing a

newwave of interest in the field of neural network.

Figure 1.7: Figure a represents a simple ANN (one-to-one), while the otherrepresent different configurations of RNN unfolded in timemapping avector to a sequence(many-to-one), a sequence to a vector(one-to-many)and a sequence to a sequence (many-to-many).Figure source: [34]

Until this moment only Feed Forward Neural networks were considered,

but there are other architectures of deep neural networkswhichwere highly

successful in different tasks like: convolutional neural networks for images

classification and generation, auto-encoders for image denoising or data

compression and recursive neural networks for structured data. Recurrent


neural networks becamemore andmore important during the last years for

the analysis of sequences or time dependent data. They can be considered

as special case of recursive neural networks applied to unary trees where

eachmember of the unary-tree is a obtained by the application of neural

network on the predecessor node. In case of a sequence represented as

{x0, x t−1, x t , x t+1xn} where t is the position of an element in a sequence and

n represents the last one, the network can be so defined:

a(t ) =b+Wh(t−1) +Ux(t ) (1.27)

h(t ) = fact(a(t )) (1.28)

o(t ) = c+Vh(t ) (1.29)

y(t ) = fact(o(t )) (1.30)

whereW,U and V are matrices representing the weights between the input-

hidden, hidden-hidden and hidden-output layers. The loss of themodel in

this case is given by the sum of the losses along all time steps. In these for-

mula can be seen that the calculation of the hidden layer takes into account

the hidden layer in the preceding time step and the input at the current

one and that parametersW,U and V are shared. With this inmind, a RNN

can be seen as a deep neural network, where each layer is sharing the para-

meters and represented as directed acyclic graph (DAG). For this reason,

RNN are trained using amethod similar to back-propagation called back-

propagation-through-time (BPTT) and can incur in the same problems

of vanishing and exploding of the gradient. Therefore, long dependencies

in the network are hard to be learned by the network. A solution to this


problem was achieved with the invention of Long Short-Term Memory

LSTMmodel proposed byHochreiter and Schmidhuber in [28]. LSTMhas a

modular structure similar to RNNs with incoming inputs from the previous

LSTMcell and from the current time step and outgoing information flowing

to the next LSTM cell or being used to predict the output at that time step,

but in comparison with normal RNN, LSTMoffer a way to regulate this flow

of information through gates defining what has to be "remembered" and

what has tobe "forgotten". The structure of anLSTMcanbemathematically

so defined using the formula from [31, 410-411]:

Figure 1.8: LSTM cell structure: the σ represents the sigmoid functionsused to build the forget state and output gate, x and + representelementwise multiplication and addition respectively.Figure source: [42]


f (t )i =σ

ˆb f

i +Xj

U fi , j x(t )

j +Xj

W fi , j h(t−1)

j

!(1.31)

s(t )i = f (t )

i s(t−1)i + g (t )

i σ

ˆb f

i +Xj

Ui , j x(t )j +X

jWi , j h(t−1)

j

!(1.32)

g (t )i =σ

ˆbg

i +Xj

U gi , j x(t )

j +Xj

W gi , j h(t−1)

j

!(1.33)

h(t )i = tanh(s(t )

i )q (t )i (1.34)

q (t )i =σ

ˆbo

i +X

jU o

i , j x(t )j +X

jW o

i , j h(t−1)j

!(1.35)

The internal structure of an LSTM involves four neural networks layers,

which are connected one another in a special way to allow the cell to re-

member or forget things. Specifically, there is a forget gate represented by

the first σ in figure 1.8 and calculated through equation 1.31, a state gate

represented by the second one and calculated by equation 1.32 and an

output gate represented by the third one and calculated by equation 1.35.

The first gate is responsible for the removal of unwanted information, the

second one is responsible for the addition of information to the cell state

and the last one defines the output [31]. These gates are neural network

layers and use a sigmoid function, whose outputs are in the range zero

one, and a element-wisemultiplication or addition to define the part of the

information that has to be forgotten or added. Thanks to this elementwise

addition the cell does not show the vanishing of the gradient avoiding one

of themajor problem, which hindered the learning of long dependencies

with RNN.


Figure 1.9: Examples of faces generated with a Deep ConvolutionalGenerative Adversarial Network.Figure source: [12]

1.3 Generative Adversarial Network (GAN)

Since 2006 the interest in deep learningmethods increasedmore andmore

enhanced by the numerous successes achieved by these models. At the

same time, also deep learning methods evolved impressively fast. New

architectures to tackle different tasks, more powerful training algorithms

and optimization techniques were proposed and applied to many tasks

spanning from unsupervised to supervised and reinforcement learning.

In the previous chapters, the Perceptron and the ADALINE networks were

introduced and both represent example of discriminative models which

try to learn a conditional distribution where a model tries to infer the label

of a samples given its features. Despite for some tasks the manipulation

of a conditional distribution can be of interest, for others, being able to

1.3. GENERATIVE ADVERSARIAL NETWORK (GAN) 27

manipulate the full joint distribution is desirable. In this chapter Gener-

ative Adversarial Networks (GANs) will be introduced. Generative models

are models that tries to learn the distribution fromwhich the data in the

data-set were sampled. Here, this distribution is called data generating dis-

tribution. According to I.Goodfellow’s classification schema proposed in

[20], generativemodels can be divided in explicit densitymodels, which are

those trained using directly the full joint distribution, and implicit density

models, which are those trained using the data generating distribution in-

directly by sampling from it. Examples of generative models are variational

auto-encoders, Boltzman machines, which are explicit density models,

and generative adversarial networks, whichmakes part of implicit density

models. Generative models became of interest in those fields where the

generation of new samples is required for example in simulations when

new scenarios must be generated or coupled to reinforcement learning

to generate new environments for the agents and could be also used to

give an RL agent some kind of imagination where the generativemodel is

used to generate new virtual environments where it can carry out actions

and hypothesize consequences. In the past years, generative models were

successfully used for the improvement of images resolution [ 38], new im-

ages generation from sketches [32] or for music [33] and text generation

[30]. Generative adversarial networks were proposed for the first time by

Ian Goodfellow and colleagues in 2014 [21] and during the last years their

popularity exploded. A GAN is amodel trained using an adversarial process

in which two actors, a generator G and a discriminator D compete each

other. While the task of the generator is to produce samples which look

real, the one of the discriminator is to understand whether the generated

samples are real or not. During this process the generator receives a feed-


Algorithm 2 TRAIN GANprocedure TRAIN GAN(D,G ,Dat a − set)

2: for ♥✉♠❜❡r ♦❢ tr❛✐♥✐♥❣ ✐t❡r❛t✐♦♥s dofor ❦ st❡♣s do

4: ❙❛♠♣❧❡ ♠ ♥♦✐s❡ s❛♠♣❧❡s ❢r♦♠ pg (③)❙❛♠♣❧❡ ♠ s❛♠♣❧❡s ❢r♦♠ pd at a(①)

6: ❯♣❞❛t❡ ❉ ❜② ❛s❝❡♥❞✐♥❣ ✐ts st♦❝❤❛st✐❝ ❣r❛❞✐❡♥t✿

J (D)(θ(D),θ(G)) =∇θd1m

Pmi=1

¡logD

¡①(i )

¢+ log¡1−D

¡G

¡③(i )

¢¢¢¢8: end for

❙❛♠♣❧❡ ♠ s❛♠♣❧❡s ❢r♦♠ ♥♦✐s❡ pg (③)10: ❯♣❞❛t❡ ● ❜② ❞❡s❝❡♥❞✐♥❣ ✐ts st♦❝❤❛st✐❝ ❣r❛❞✐❡♥t✿

J (G)(θ(D),θ(G)) =∇θg1m

Pmi=1 log

¡1−D

¡G

¡③(i )

¢¢¢12: end for

end procedure

back about the quality of the generated samples from the discriminator and

improves itself to produce samples of better quality and the discriminator

become better in understanding the origin of the samples. Both D and

G can be whichever differentiable function and a general choice is to use

artificial neural networks. They can have different architectures depending

on the type of samples one wants to generate. The task of the generator is

to map gaussian noise to the sample space and fed the discriminator with

it. The discriminator receives both data coming from the generator and

from the original data-set. When the data are coming from the data-set the

discriminator tries to output 1, while, when they are generated samples,

the generator will push the discriminator to output 1 and the discriminator

will try to output 0. This process is often compared to a game because at the

same time the discriminator is trying to minimize J (D)(θ(D),θ(G)) but only

allowed to tweak parameters θ(D), on the other hand the generator tries to

minimize J (G)(θ(D),θ(G)), but it is allowed to tweak only the parametersθ(G).

This process is described in 2, which is the original algorithm proposed by

1.3. GENERATIVE ADVERSARIAL NETWORK (GAN) 29

Goodfellow and coworkers in [21]. In this article the convergence of this

algorithm and the equality between the data generating distribution and

the data distribution when the global minimum is reached were proved.

However, in practice this approach does not work well due to the feedback

passed from the discriminator to the generator in the form of gradient,

which becomes smaller and smaller when the confidence of the discrim-

inator increases as proved in [4]. For this reason, the use of a different

objective to train the generator aiming to solve this problemwas defined

as:

J (G) = 1

2Ez logD(G(z)) (1.36)

This different objective should guarantee a better flow of the gradient from

the discriminator to the generator, but further theoretical analysis carried

out in [4] showed that the price paid for a better flow of the gradient is

the instability that may emerge during the training. Since the first article

about GANs new models aiming at improving the quality of generated

samples and stabilize the training process, were proposed. DCGAN,Was-

sarsteinGANand infoGANare some example of these improved implement-

ations. The popularity they achieved thanks to the quality of the produced

images compared to other generative models let hope for the use of this

method for the generation of data required in other fields like Chemistry

and pharmacology.


Figure 1.10: Structure of a generative adversarial network. Vectors sampledfrom a Gaussian distribution with zeromean and variance equal 1 are fedinto the generator generating fake samples. Fake samples and real ones arefed into the discriminator, whose task is to discriminate whether samplesare real or not.

1.4 Machine Learning in chemoinformatics

In the past decades the amount of data produced in the field of Chemistry

undergone an exponential increase. Breakthroughs in different fields, span-

ning from array based technologies to liquid-handling ones and robotics,

allowed theminiaturization of common procedures, which were generally

carried out by operators. This improvementmade possible to overcome the

throughput of past technologies making them compatible with ultra-High

Throughput Screening (uHTS) methods and opening this field to computa-

tional methods andmachine learning techniques [37]. Chemoinformatics

or cheminformatics are themost common names used to refer to the ap-

plication of these methods on chemical data [37]. General data used in

1.4. MACHINE LEARNING IN CHEMOINFORMATICS 31

Chemoinformatic work-flows involve SDF file (Structure Data Format),

WLN (Wiswesser Line Notation) or SMILES (Simplified Molecular Input

Line Entry Specification), which are representations of 2D or 3D chemical

structures. In the past years the SMILES notation has becomemore and

more common due to the simplified rules used in comparison with the

WLNones [37]. The use ofmachine learningmethods in chemoinformatics

has become especially important since their use for the inference of mo-

lecules’ properties or their activities in Bio-assays. The first uses of these

methods for these purposes, which are currently defined as QSAR (Quantit-

ative Structure-Activity Relationship) and QSPR (Quantitative Structure-

Properties Relationship) respectively, are dated back to the 1935 [ 22], and

1964 [23]. Initially, only simple linear regressionmodel on compoundswith

few descriptors and covering small chemical spaces could be applied due

to the lack of computational power and data, but nowadays, these limit-

ations are being overcome and newmethods extending the applicability

of QSAR and QSPR to nonlinear classification and regression tasks have

been implemented and exhaustively studied. Repositories like PubChem,

ChEMBL ecc., have been created to allow storage and retrieval of chemical

information playing an key role in the evolution of chemoinformatics. The

general work-flow of QSAR and QSPR follows two steps: an encoding step

and amapping one [39].

ActivityorProperty = f(structure) =M (E (structure)) (1.37)

During the encoding process, molecules are generally encoded into vec-

tors of chemical descriptors, which are "numerical values that characterize

properties of molecules", as defined in [37], calculated from the 2D or 3D


structure andmutual-orientation and time-dependent dynamics of mo-

lecules, whichwere respectively named2D, 3Dand4Dchemical descriptors

[7]. More than 5000 descriptors have been defined [49], [54] and among

them, ClogP, Molecular Refractivity, topological indices like the Wiener

index and 2D fingerprints are some examples, which can be calculated

through open-source software like PaDEL [62] or closed-source ones as

DRAGON 7.0 [2]. During the second step, these vectors are mapped to

a property or activity class through a function, which is generally what

most of machine learningmethods try to optimize. Other approaches were

suggested, to directly extract features frommolecular structures, reducing

problems concerning descriptors definition, their computation and feature

selection. An example of these methods was presented in [39] by Lusci

and colleagues, in which they used a Recursive neural network to encode

undirectedmolecular graphs into vectors retaining in this way both struc-

tural and chemical information andmanaging to obtain, with this model,

state-of-the-art performances in the prediction of aqueous solubility [39].

QSAR and QSPRmethods are playing a significant role in drug discovery

and especially in "de novo" drug design. Indeed, prediction of molecu-

lar properties can be highly valuable for the evaluation of those chemical

compounds, whichmay pass all the phases of the drug development life

cycle. The solubility of drugs in water is an example of this. Indeed, it

defines the body absorption efficiency of the chemical compounds be-

ing analyzed, allowing the rejection of active compounds that, otherwise,

would be discarded in successive stages. Recently, other measures as drug

toxicity highlight the usefulness of thesemodels for the rejection of toxic

molecules in early phases of drug development, increasing so the quality of

selected candidates. In the previous chapters, the success of deep learning

1.4. MACHINE LEARNING IN CHEMOINFORMATICS 33

models and generative ones in many different fields, were introduced. As a

result, many attempts to apply thesemethods in chemoinformatics were

made. The work carried out at the Bioinformatics Institute of the Johannes

Kepler University of Linz [40] is a illustrative example of the power of deep

learningmethods, multitask and ensemble learning in toxicity prediction.

During the Tox21 data challenge, this model (DeepTox) achieved the best

performances among all computational methods inmany assays [40]. The

work carried out by Bombarelli and coworkers, instead, is an exemplificat-

ive example of generative models in Cheminformatics. Indeed, they used a

RNN-variational-auto-encoder to encode SMILES strings into latent chem-

ical space and decode themback, enabling so the use of this latent space for

the generation of newmolecules having specific properties [19]. The use of

a 3-stacked-LSTM for the generation of molecules was presented by Seglar

at al. in their article [50], where they showed the ability of their model to

produce both data-sets of general molecules and data-sets enriched inmo-

lecules with specificmolecular properties, which they used to implement

an in-silico "de novo" drug design cycle. Due to the high success of deep

learning and generative models in chemoinformatics and that of GANs

for image generation, the use of a generative adversarial network, for the

generation of chemical compounds, is proposed.


Figure 1.11: Workflows for QSAR and QSPR. Data are encoded in chemicaldescriptors and fingerprints and labeled with activities or properties.These data are than used to train amodel, which will learn to predictproperties or activities of features chemical compounds.

1.5 Aims of themaster thesis

The development of new drugs is a highly expensive and risky process,

which involves the individuation of a target and the definition of the lead

1.5. AIMS OF THEMASTER THESIS 35

compound. This is generally achieved after searching in libraries of chem-

ical compounds for the presence ofmolecules with the desired characterist-

ics, filtering out those showing undesired ones like: toxicity and insolubility

[37]. This filtering process is computationally and time expensive, but can

dramatically increase the probability for a selected compound to reach

themarket. It reduces the number of compounds that should be tested in

laboratory, the amount of money that should be invested in this process

and the probability compounds have to fail in successive steps. However,

the success of these screeningmethods is highly dependent on the chem-

ical compounds present in the searched libraries and the hand-crafted

rules used to screen them [37]. In the last years, the chemical space of

synthetically available molecules was estimated to be around 1060 [19], un-

fortunately also the newest technologies are still far from the achievement

of this throughput and consequently, computational methods like QSAR

or QPAR are used to scoremolecules present in chemical libraries select-

ing in this way only the most promising ones for the next steps. As seen

before, these methods are highly dependent on compounds present in

chemical libraries, which in turn are dependent on hand-crafted rules used

at compounds generation time. As a consequence, the search is led towards

a specific part of the chemical space defined by these rules, leaving mo-

lecules lying on others overlooked [19]. Of course, new rules can be defined,

but the time and efforts required are huge. Therefore, newmethods able to

automatically extract these rules would be extremely desirable, because,

from the one hand, they would reduce the problem concerning the part of

the chemical space analyzed, from the other hand, they would not require

the definition of rules.

In this thesis generative adversarial networks are proposed for the first


time as a possible method aiming at solving these problems, allowing the

generation of new chemical compounds without previously defined hand-

crafted rules, covering, in this way, a wider chemical space and providing

a new powerful tool to chemoinformatics and drug discovery. With this

purpose, two architectures were implemented in Keras and respectively

called: Chemo-GAN and Latent-Space-GAN. The first one is a generative

adversarial network trained onmolecular fingerprints where both gener-

ator and discriminator are fully connected artificial neural networks. The

second one uses an auto-encoder to map SMILES strings from and to a

latent space and uses this latent space representation of SMILES strings to

train a generative adversarial network. In this case, the auto-encoder uses

LSTM layers to consider relations between each character and those pre-

ceding or following it, while the GAN uses fully connected neural networks

for both the generator and the discriminator.

For the evaluation of these models a new metric, the Fréchet Tox21 Dis-

tance (FTOXD), was defined, which was inspired by the Fréchet Inception

Distance (FID) proposed in [25]. With the definition of this method, a way

to evaluate the distance between the distribution of molecular fingerprints

coming from the original data-set and the one represented by the gen-

erator part of the GAN is provided. This metric offers the possibility to

measure samples quality of generated SMILES strings andmolecular fin-

gerprints and evaluate the capability of a GAN to approximate the original

data distribution considering during this evaluation also automatically-

extracted chemical highly relevant features. For the evaluation ofmolecular

fingerprint similarity also the Tanimoto coefficient was used. The second

modelwas instead evaluatedusing the FTOXD togetherwith thepercentage

of valid generatedmolecules. Bothmodels were trained on data derived

1.5. AIMS OF THEMASTER THESIS 37

from the ChEMBL data-set, while themodel used for the calculation of the

FTOXDwas trained on the Tox21 data-set.

In the next chapter, thesemethods are explained in detail. In the first sub-

chapter the data-sets used are explained together with the work carried out

to derive them. Afterwards, themethods for the evaluation of themodels

are described. Finally, the Chemo-GAN and the Latent-Space-GAN are

described.


2. Methods

2.1 Simplifiedmolecular-input line-entry system

The invention of computers opened new possibilities for the storage of

chemical structures of chemical compounds. For this reason, during the

20th century, different systems have been studied with the aim of efficiently

storing chemical compounds structures, avoiding redundancy and provid-

ing an easy and exchangeable data format. During the 80s, the Simplified

molecular-input line-entry system (SMILES) was invented [59],[37]. This

method aims to represent chemical compounds in the form of unique and

not ambiguous ASCI strings [59]. SMILES representations, which lead each

chemical compound to an unique representation are generally defined

canonical SMILES. They generally depend on the software or canonicaliz-

ation algorithm used. Indeed, chemical compounds can have multiple

SMILES representations and different software have different strategies to

select a specific one. A general approach is to use the CANON algorithm

to prioritize the atoms in the given compound and subsequently write the

SMILES atom by atom following the order defined by a depth-first traversal

of the molecular graph [55, 420-432]. This prioritization guarantees the

uniqueness and certainty of the generated representation. In this repres-

39

40 CHAPTER 2. METHODS

entation, chemical elements are generally represented by their chemical

symbols enclosed in square brackets - unless they are part of the organic

subset defined as {N ,C ,B ,O,P,F,S,C l , I ,Br } - and the adjacent characters

represent the atoms to which they are connected. Side chains are represen-

ted using round brackets and bonds by the set {−,=,#.$}, in which symbols

stands for single, double, triple and quadruple bond respectively. Partial

and ionic bonds are depicted as {:, .} . Chirality is expressed by the "@" sym-

bol indicating that the following symbols are disposed clockwise around

the chiral center. When they are written anticlockwise, the "@@" symbol is

used. Cis and trans configurations are respectively defined by "\" and "/"

and aromaticity is represented by lowercase symbols and the opening and

closing points in the ring by a number, i.e. "c1ccccc1" [55, 420-432], [59],

[37].

2.2 Molecular fingerprints

Another way to represent molecules is the use of molecular fingerprints.

Molecular fingerprints consist of vectors of features representingmolecules,

which are unique representations of the given chemical compounds [37].

There exist several types of molecular fingerprints, which differ in the way

they calculate features and store them. They can be generally divided in

binary fingerprints and count fingerprints [37]. The first ones, are binary

vectors also called bit vectors or bolean ones. Each number one in a vector

represents the presence of a specific feature of the chemical compound.

The second ones not only store features, but also the count of the times

they are found in a chemical compound. This is a dramatic difference in

comparison with the first ones. Indeed, with the latest, there are less pos-

2.2. MOLECULAR FINGERPRINTS 41

sibilities to have ambiguity between fingerprints. Two elements having the

same features in different quantity aremapped to the samefingeprint using

the first method and to different ones using the secondmethod. There are

different implementations that generally use bit vectors or list to store fea-

tures. In the first case, for each feature, a one is added in the position of the

vector representing the given feature. In the second one, the position of the

present feature is stored in a list. Thefirstmethod from the onehand, allows

fast computations, which are especially desired in similarity search tasks,

on the other hand, can be inefficient when the number of feature owned by

a chemical compound is low, due to the extremely sparse vectors that have

to be stored. The second onemay not be so fast, but is an efficient imple-

mentation in terms of memory usage. Sparsity can be reduced through the

"folding" process, which consists of dividing a bit vector in two symmetric

parts andmerge themwith a logical or operation. This method allows the

compression of molecular fingerprints to bit vectors of the desired size and

ratio of ones over zeros. However, with the decrease of vector sizes the risk

of features collision increases, which may result in an undesired loss of

information [55, 441-447]. As mentioned before, there exist several types of

molecular fingerprints which depend on the way features are calculated.

Extended connectivity fingerprints (ECFP), Chemical Hashed Fingerprints

and Pharmacophore Fingerprints are themost used ones for different pur-

poses spanning from sub-structure searching to structure-similarity and

QSAR or QSPR experiments [3], [1], [13].


Figure 2.1: ECFP generation process. For each atom, identifiers arecalculated considering the neighborhood. The diameter 0,1,2 ecc. refers tothe distance in terms of bonds separating two atoms along the shortestpath connecting them. The calculated identifiers are mapped to a specificposition in a bit vector through a hashing function. Bit collision happenswhen two different identifiers are mapped to the same position in thebinary vector.Figure source: [13]

Extended connectivity fingerprints

Extended connectivity fingerprints (ECFP) are circular topological finger-

prints. The neighborhood of each atom in a chemical compound is used to

calculate an integer, whichdefines thepositionof the represented feature in

a binary vector. The neighborhood is defined in terms of number of bonds

lying on the shortest path connecting two nodes in the molecular graph

representing a chemical compound. So, the neighborhood of diameter one

2.3. CHEMBL 43

around an atom corresponds to all atoms that are directly connected to it,

the neighborhood of diameter two is represented by the atoms directly con-

nected to the neighborhood of diameter 1 and so on. In this way, different

ECFPs are defined, which consider different diameters. The diameters used

are generally specified by an integer written after the acronyms ECFP. So,

ECFP2 represents ECFP using neighborhood of diameter 2, ECFP4 those

using neighborhood of diameter 4 and so on. The integer is calculated start-

ing at the central atom through the application of a hashing function, which

can be different from implementation to implementation, to its specific

properties. This integer is subsequently combined to those of the neighbors

to calculate the integer for the neighborhood and the process is repeated in

a recursive way until the desired diameter is reached. The integers of each

neighborhood define identifiers representing features, which are mapped

to specific positions in the bit vector representing themolecular fingerprint.

In this way presence or absence of a specific 1 in the bit vector correspond

to the presence or absence of a specific feature in the analyzedmolecule.

For this reason, ECFP are useful in structure similarity search and less in-

teresting for substructure search which is therefore accomplished using

Chemical Hashed Fingerprints [13].

2.3 ChEMBL

ChEMBL is a database storing information about drugs-like molecule [16].

This information is generally manually extracted from scientific publica-

tion and undergoesmultiple checks and normalization processes to ensure


Figure 2.2: In this figure the composition of ChEMBL version 23 is depicted.Size of the circles is proportional to the number of activities and the nameslinked to each circle represent the Bio Assay data sources.

the correctness of the chemical structure, avoiding redundancy and en-

suring consistency in representation. During this process i.e., structures

are checked for potential problems, different names referring to the same

chemical compound are linked one another, charges are neutralized and

the unit measures used standardized [16]. Retrieved information involves

absorption, distribution, excretion, metabolism and toxicity properties

(ADMET), functional assay and binding assay as well as chemical struc-

tures of chemical compounds [16]. This makes ChEMBL an important

data-set for many tasks in cheminformatics like QSPR experiments. In this

master thesis the version 23 of this data-set was used [6], which contains

1,735,442 compounds, 14,675,320 activities,1,302,147 assays, 11,538 targets

and 67,722 documents.

2.4. DATA-SETS PREPARATION 45

2.4 Data-sets preparation

In the previous sub-chapter, the ChEMBL data-set was introduced. In this

master thesis inorder to trainbothgenerative adversarial networks: Chemo-

GAN and latent-space-GAN, the 23rd version of ChEMBL was used. The

data-set was downloaded in Structure-Data-file (SDF) format. This type of

format contains molecules inMDL format delimited by "$$$$" [11], [10].

This is not only used to includemoreMDLfiles in a single SDF, but also to in-

clude furthermeta data, i.e. molecular properties like themolecular weight

or themolecular ID. MDL files are molecular representation, whose struc-

ture can be divided in four parts. The first one is the header, which contains

the name of themolecule and information concerning the program that

generated theMDL file [10]. The second part contains information about

the atoms, i.e the spatial coordinates x,y,z and information about the type

of element [11]. Therefore, this part is generally defined the atom-block.

Instead, each line in the third part describes a bondbetween two atoms and

is therefore called the bondblock [11]. The last part of theMDLfile contains

further information about molecular properties [10]. To obtain a data-set,

which could be fed in the generative adversarial network, the SMILES rep-

resentation of eachmolecule contained in the SDF file had to be extracted.

This process was carried out using the RDkit package in python, which

provides the functionMolToSmiles(), which can be used for this purpose

[53]. From this point on, two different work-flows were used to generate

the data-sets containingmolecular fingerprints and SMILES strings, which

were respectively used to train the Chemo-GAN and the latent-space-GAN.


Table 2.1 Examples of data-set used for the Chemo-GAN

SMILES strings 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

0 CCCCOC(=O)CSc1nnc(-c2cc(OC)c(OC)c(OC)c2)o1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0

1 CCCCc1cc(O)c(CCCC)c(O)c1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 O=C(Nc1cc(N2CCOCC2)ncn1)c1ccccc1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 O=C(c1cccc(F)c1)N1CCCC2(CCN(C(c3ccccc3)c3ccccc... 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1

4 Cc1noc(C)c1C(=O)N1CCC2(CCCN(C(c3ccccc3)c3ccccc... 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0

5 O=C(c1ccncc1)N1CCC2(CCCN(C(c3ccccc3)c3ccccc3)C... 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0

6 O=C(c1cnccn1)N1CCC2(CCCN(C(c3ccccc3)c3ccccc3)C... 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0

7 O=C(NCCN1CCOCC1)c1cc(-c2cccs2)on1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

8 CCCc1[nH]nc2c1C(C1CCCCC1)C(C#N)=C(N)O2 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

9 O=C1CC(c2ccco2)CC2=C1C1c3ccccc3C(=O)N1c1ccccc1N2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Table 2.1: Example of the data-set used for the training of the Chemo-GAN.Each row represents a chemical compound. The first column of each rowrepresents the index, the second one the SMILES string, while all theother(from the third to the 2,048th) represent themolecular fingerprints.

Data-set generation for Chemo-GAN

To train theChemo-GAN, SMILES strings extracted from the SDFfile, which

were downloaded fromChEMBL, were converted inmolecular fingerprints.

To achieve it, the RDkit package was used. Specifically, the Morgan fin-

gerprints were calculated through the function getMorganFingerprintAs-

BitVector(). This function uses a variant of theMorgan algorithm to calcu-

late this type of molecular fingerprints, which are topological fingerprints

comparable to the ECFP. This implementation is described in [45] and is

very similar to the Canon algorithm described in chapter 2.2. In this case,

only SMILES strings with a length lower or equal to 40 were considered.

This resulted in a data-set of 382,688 molecules after the removal of invalid

structures and duplicates. For these SMILES strings, molecular fingerprints

with a diameter of 4 were calculated and folded to 2,048 bits.

2.5. EVALUATION 47

Data-set generation for Latent-Space-GAN

To train the Latent-Space-GAN, SMILES strings were converted to numbers.

This process was carried out at a character level. Only SMILES strings

having a length lower of 40 characters were used. From this new data-set

containing 382,754 chemical compounds, duplicates were removed giving

rise to a data-set of 382,748 SMILES strings. These SMILES strings were

checked for structural problem with the RDkit package. After this check

a final data-set containing 382,688 chemical compounds was obtained.

53 different characters were found to be present in the data-set, but most

of them had a low frequency. For this reason and to speed the process

up, to the 15most commonly encountered characters, a different number

was assigned, while all the others were encoded with a one. A dictionary

was built to map the characters present in the SMILES strings to numbers

and vice versa. At the end of each sequence a end of the sequence symbol

("|") was added. Furthermore, sequences having a length shorter than 40

characters were padded with zeros to obtain a data-set of sequences of the

same length.

2.5 Evaluation

The quality of the Chemo-GAN and the Latent-Space-GANwas evaluated

through a new distancemeasure, which was called Fréchet Tox21 Distance

(FTOXD). Furthermore, to infer possible model collapses, the Tanimoto


Figure 2.3: This figure represents the sequence length distribution ofSMILES strings present in the data-set expressed as percentage.

Figure 2.4: This figure represents the number of times each characterappears in the data-set expressed as percentage.

2.5. EVALUATION 49

Table 2.2 Examples of data-set used for the training of the Latent-Space-GAN

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

0 1 5 3 1 8 9 1 3 7 10 2 3 7 4 7 4 16 1 8 4 1 9 3 1 3 1 5 4 2 4 2 17 0 0 0 0 0 0 0 0

1 2 3 10 7 4 3 2 1 5 1 1 1 3 1 1 5 4 6 4 1 5 1 3 1 1 3 1 3 1 5 6 4 6 2 4 6 4 6 17 0

2 14 9 16 13 5 1 8 1 1 1 3 1 1 8 1 1 5 2 3 10 6 4 6 2 2 4 2 3 10 6 4 6 17 0 0 0 0 0 0 0

3 2 3 10 6 4 3 1 5 1 1 1 3 1 1 5 4 16 4 7 6 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 9 5 1 3 7 2 4 1 8 1 3 9 3 1 9 8 4 2 4 1 8 1 5 16 1 3 9 8 4 15 2 17 0 0 0 0 0 0 0 0

Table 2.2: Example of the data-set used for the training of theLatent-Space-GAN. Each row represents a SMILES string. The fifteenmostcommon characters are encoded with integer numbers from one to fifteen.The uncommon characters are encoded with the number sixteen. Thenumber seventeen encodes the end of the sequence and the zero is usedfor the padding.

coefficient was used. In the case of the Latent-Space-GANwe also calcu-

lated the percentage of generated valid SMILES strings. While the aim of

the use of the FTOXD is the assessment of the distribution similarity, the

Tanimoto coefficient was used to notice possible model collapses. Indeed,

without a measure to assess this, models could be considered good despite

the fact they are producing always the same SMILES string due to the high

score given to the producedmolecules. The use of the FTOXDwas inspired

by the Fréchet Inception Distance used byMartin Heusel and coworkers in

[25]. In the following sub-chapters, thesemeasures are introduced.

2.5.1 Tanimoto coefficient

The Tanimoto coefficient was defined for the first time in 1960 [46] as a

possible measure for plant similarity. In this context, it was used to express

the similarity of plants in terms of equal features and distinct ones showed

by a plant. A set of features can be represented as a bit vector with one entry

per feature, in which each entry equal to one represents the presence of the

features encoded at that position in the considered chemical compounds,

as explained in chapter 2.2 for molecular fingerprints. Using bit vectors the


Tanimoto coefficient is expressed as:

T8 =P

i (Xi ∧Yi )Pi (Xi ∨Yi )

(2.1)

where ∧ and ∨ represent the logical AND and logical OR operators re-

spectively. The Tanimoto coefficient was used to check generated finger-

prints similarity one another. This coefficient wasmeasured between 100

pairs of generated fingerprints each 500 updates.

2.5.2 Fréchet inception distance

TheFréchet inceptiondistance (FID),whichwas usedbyMartinHeusel and

coworkers in [25], improved the inception score described in [48], providing

ameasure of the difference between data distribution and generated data

one. The Inception Score is defined as:

Inception Score = exp(ExKL(p(y|x)kp(y))) (2.2)

in which p(y |x) is the conditional distribution of the labels obtained as

output of the Inceptionmodel when feeding generated samples in it and

p(y) is the marginal distribution. In this case, x = G(z), in which G is the

generator. This score expresses a quality evaluation of generated samples,

but it does not take into account the original data distribution, which could

be different. Instead, for the calculation of the FID, both generated data and

real ones are fed in the Inceptionmodel and theoutputof somehidden layer

is retrieved to obtain visual relevant features. In this way, two distributions

are obtained, which are assumed to follow amultidimensional Gaussian

distribution, because this is themaximumentropy distribution. Finally, the

2.5. EVALUATION 51

FID is calculated as the Fréchet distance between the first twomoments of

these two Gaussian distributions as depicted in 2.3.

FID = d2((m,C), (mw,Cw)) = km−mwk22 +Tr(C+Cw,−2(CCw)

12 ) (2.3)

2.5.3 Fréchet Tox21 Distance

To evaluate the Chemo-GAN and the Latent-Space-GAN, the Fréchet Tox21

Distance (FTOXD) was designed. This metric is similar to the FID but in-

stead of using the inceptionmodel to generate the conditional probability

p(ykG(z)), it uses amodel trained on the Tox21 data-set, which was called

Tox21-FTOXDmodel. The Tox21 data-set, which consists of 12,000 training

data and 647 test ones, was used. From the chemical structures contained

in this data-set, the equivalent of the ECFP4 molecular fingerprints was

calculated for each compound through the rdkit package in python. The

Tox21-FTOXDmodel was designed as a 3 layer fully connected neural net-

work. 1,024 units were used for each hidden layer and 12 outputs corres-

ponding to the different labels were used. In each hidden layer, the selu

activation function with the Lecun weight initialization was used. Many

labels provided in the Tox-21 data-set were missing. Therefore, missing

valuesweremasked during the training. The binary cross-entropywas used

as loss function. This model obtained an AUC of 0.74 on the test set. Gen-

erated fingerprints were fed in this model and the outputs were extracted

from its second hidden layer to have chemical relevant features. Molecu-

lar fingerprints derived from SMILES strings belonging to the test set of

the ChEMBL data-set were used to calculate the conditional distribution

p(ykxd at a), while the conditional distribution p(ykG(z))was calculatedwith

the generatedmolecular fingerprints. Formula 2.3 was used to calculate


the FTOXD using the means and co-variance matrices derived from the

distributions obtained using the Tox21-FTOXDmodel.

2.6 Chemo-GAN

Chemo-GANwas the first method implemented and analyzed in this mas-

ter thesis, through which we tried to applied generative adversarial net-

works for the first time in chemoinformatics. It is a generative adversarial

network composed by a generator and a discriminator, which were both

implemented as fully connected multi-layers artificial neural networks.

The architecture was implemented using Keras, which is a high-level API

able to run on top of different open source libraries such as Tensorflow and

Theano [9]. For the implementation of all models, Keras was run on top of

Tensorflow. The generator and the discriminator were connected giving

rise to the Chemo-GAN. To train the discriminator, a set of molecular fin-

gerprints was randomly sampled from the fingerprints data-set and labeled

with ones and a second set of generatedmolecular fingerprints having the

same size of the other set, was sampled from the generator. In the contest of

generative adversarial network, the process of sampling one sample from

a generator is carried out by feeding the generator with a random vector

sampled from a predefined prior, carrying out a forward pass through the

network and retrieve the output of the last layer. In this method a gaussian

prior with µ= 0 and σ= 1 was used. For each update of the discriminator,

firstly, the discriminator was trained on real data labelled with ones and

secondly, it was trained on generated data labeled with zeros. The training

of the generator was realized using the Chemo-GAN after having frozen the

weights of the generator. Instead of optimizing the generator descending

2.6. CHEMO-GAN 53

its stochastic gradient (J (G)(θ(D),θ(G))), the generator was trainmaximizing

the log probability of the discriminator being mistaken, as suggested in

[20], by flipping, instead of the sign of the cost function, the labels given to

the samples. Also during the training of the generator, the weights of the

discriminator were frozen. The different objective for the generator was

used to avoid the vanishing of the gradient when the confidence of the dis-

criminator becomes too high as suggested in [20]. The learning processwas

monitored using the FTOXDmeasure between generatedmolecular finger-

prints and the original data-set. Tomonitor also possible model collapses

the Tanimoto distance between the generatedmolecules was calculated.

These measures were taken each 500 updates of themodels, where each

update was carried out using a batch size of 10,000 samples. In a first phase,

different architectures with different hyper-parameters were trained. A

common structure was used for all models. Both generator and discrimin-

ator have one hidden layer. During the tuning of the parameters hidden

layers were added to both generator and discriminator. All generators were

implemented as fully connected feed forward neural networks taking vec-

tors of size 2,048 as input and generating vectors with a size equal to 2,048.

Their last layer used a sigmoid activation function. All discriminators were

implemented as fully connected feed forward neural networks taking vec-

tors of size 2,048 as input and generating a prediction in the range zero-one

through a sigmoid activation function. Each generative adversarial net-

work used the binary cross-entropy as loss function. Number of hidden

layers, activation functions for the hidden layers and the learning rates of

generator and discriminator were tuned. Stochastic gradient descent was

used as optimizationmethod.


Figure 2.5: This figure represents the Latent-Space-GAN. Original data areencoded in a latent space through an encoder (ENC), while fake samplesare obtained through the generator (G), whichmaps Gaussian noise to thesame latent space representation. This latent space representation issubsequently used for the training of the GAN composed by G and D.Generated samples are decoded to the SMILES encoding system through adecoder network.

2.7 Latent-Space-GAN

The second part of thismaster thesis was focused on the implementation of

the Latent-Space-GAN. The name derives from the fact that this generative

adversarial network learns how to produce a latent space representation of

SMILES strings. Basically, the Latent-Space-GAN is composed by fourmod-

2.7. LATENT-SPACE-GAN 55

els: a generator, a discriminator, an encoder and a decoder. The encoder

part is used in the first phase of the training to project real data(SMILES

strings) to the latent space. These data are used as real ones during the train-

ing of the generative adversarial network. During the training process, both

fake samples, which are sampled from the generator, and real-encoded

ones are fed in the discriminator, whose task is to provide a feedback in

form of a gradient to the generator, helping it, in this way, to improve itself.

Generated samples are finally mapped back from the latent space to the

SMILES space through the decoder part of themodel. During the data-set

generation, uncommon characterswere encodedwith the samenumber, in

this case sixteen was used. For this reason, the generator produces SMILES

strings containing from time to time some sixteens, which needed to be

replaced by one of the 35 possible uncommon characters before the evalu-

ation of the model. To assign a character to each sixteen, another model

was used. This model, which was called the "corrector", replaced all six-

teens with a character, whose selection is context-based. This model was

implemented as a stacked LSTM. The Latent-Space-GANwas trained in two

different moments. Firstly, the Auto-encoder was trained. Subsequently,

generator and discriminator were trained. The quality of the model was

measured with the FTOXD and the percentage of valid generated SMILES

strings.

2.7.1 Auto-encoder

The auto-encoder is a neural network, which tries to learn the identity

function under certain constraints that hinder this process, i.e. noise ad-

dition, dropout or hidden layer size reduction. This generates an efficient


representation of the data due to the ability of the auto-encoder to exploit

patterns. It is composed by an encoder and a decoder networks. Both were

implemented using recurrent neural networks, which allow themodel to

encode information concerning the contest in which characters are placed

through the sharing of parameters. To enable themodel to "perceive" the

context at both side of a character, bidirectional layers were used. A bidirec-

tional layer is composed by two recurrent neural networks, which read the

sequence starting from opposite ends and whose weights are used at the

same time to calculate the values of neurons of the next layer. Instead of

using simple recurrent neurons, LSTM cells were used, which do not incur

into the vanishing of the gradient and allow to add or remove information

from the cell state. LSTMwere successfully applied tomany problems be-

longing to different fields spanning from speech recognition to chemistry,

showing the ability to learn dependencies between distant elements and

that of dropping useless information. On top of bidirectional LSTM layers,

LSTM layers were stacked, which should provide themodel a further level

of abstraction in whichmore complex features can be learned. Gaussian

noise was introduced in themodel to force it to learn and generalize better

and for the same reason also dropout was used in each layer. In this model,

data are embedded through an embedding layer and subsequently fed in

the bidirectional LSTM layer. The last layer of the architectures applied a

softmax activation function to each time step of the sequence retrieved

from the previous layer, defining in this way a distribution over the char-

acters at each time step. The predicted sequence was obtained by the last

LSTM layer and the characters were retrieved taking the argmax of each

time step of these predictions, which represents the character to which the

model assignmore probability. This is represented in Formula 2.4, in which


L is the index of the last layer, and i ∈ R18 (18 stands for the dimension of

the used alphabet) represents the i-th output of themodel.

prediction = argmaxi

(softmax(xL−1i )) (2.4)

The weights of themodel were updated using the RMSprop described in

chapter 1.2.2 and the categorical cross entropy was used as loss function.

Two different latent space representations of the model were tried. In

one case the values of the vector were constrained between one and zero

through the sigmoid function and this representationwas called "sigmoidal

latent space", while in the second one a linear activation was applied and

the representation called "linear latent space". Due to the time needed to

train themodel, no cross validationwas carried out, but a hold-out data-set

corresponding to the 35% of the data, was left out for testing. This was done

to guarantee an unbiased estimation of the prediction score on feature data.

The rest was divided in training set (80% of the data ) and validation set

(20%). The models were trained on the training set and validated on the

validation set to select the best parameters. The performances of these

models on the validation set were reported in Figures 2.6 2.7. The mod-

els that achieved the best performances were trained for 2000 epochs and

tested on the test set. Both architectures achieved an accuracy above 90%

on the test set.

2.7.2 Generator and Discriminator

The generator and the discriminator were implemented using fully connec-

ted hidden layers using the selu activation function in each hidden layer.

They were trained in a second step after the selection of the auto-encoders.


0 50 100

150

200

250

300

epoch

l_ls_tanh_0.005_150_0.3l_ls_tanh_0.005_100_0.5

l_ls_sigmoid_0.01_100_0.3l_ls_tanh_0.005_150_0.5

l_ls_sigmoid_0.01_100_0.5l_ls_tanh_0.005_100_0.3

l_ls_sigmoid_0.01_150_0.3l_ls_sigmoid_0.005_150_0.3l_ls_sigmoid_0.005_150_0.5l_ls_sigmoid_0.005_100_0.5l_ls_sigmoid_0.005_100_0.3

l_ls_tanh_0.01_100_0.3l_ls_tanh_0.01_150_0.3l_ls_tanh_0.01_100_0.5

l_ls_sigmoid_0.01_150_0.5l_ls_tanh_0.01_150_0.5l_ls_tanh_0.01_300_0.3l_ls_relu_0.005_150_0.3l_ls_relu_0.005_100_0.5

l_ls_relu_0.01_300_0.3l_ls_relu_0.01_150_0.3l_ls_relu_0.01_100_0.5

l_ls_relu_0.005_150_0.5l_ls_relu_0.01_100_0.3l_ls_relu_0.01_150_0.5

l_ls_relu_0.005_100_0.3

mod

els

Accuracy AE per epoch

0.15

0.30

0.45

0.60

0.75

Figure 2.6: In this figure the accuracy of themodels in reconstructing theoriginal SMILES string are represented for the auto-encoders using thelinear activation function to calculate the latent space. Each row of thisfigure represents the accuracy of a specificmodel defined by the name onthe y axes during the training. The color encodes the value of the accuracyat each epoch. The darker the color, the lower the accuracy.

The generator was used to map gaussian noise to a vector of continues

values of the same size of the encoder output vector. The same activation

functions used to calculate the output of the encoders were used to calcu-

late the output of the generator. A generator with linear output and one

with a sigmoidal output were so obtained. Dropout was used as regulariza-


0 50 100

150

200

250

300

epoch

s_ls__tanh_0.005_100_0.3

s_ls__tanh_0.005_100_0.5

s_ls__tanh_0.005_300_0.5

s_ls_sigmoid_0.01_100_0.5

s_ls__sigmoid_0.005_100_0.3

s_ls_tanh_0.01_100_0.5

s_ls__sigmoid_0.005_100_0.5

s_ls__tanh_0.005_150_0.3

s_ls__tanh_0.005_150_0.5

s_ls__sigmoid_0.005_150_0.5

s_ls__sigmoid_0.005_150_0.3

s_ls_tanh_0.01_300_0.5

s_ls_tanh_0.01_150_0.5

s_ls_sigmoid_0.01_150_0.5

s_ls__relu_0.005_100_0.5

s_ls_relu_0.01_100_0.5

s_ls__relu_0.005_300_0.5

s_ls__relu_0.005_150_0.5

s_ls_relu_0.01_150_0.5

s_ls__relu_0.005_100_0.3

s_ls__relu_0.005_150_0.3

mod

els

Accuracy AE per epoch

0.15

0.30

0.45

0.60

0.75

Figure 2.7: In this figure the accuracy of themodels in reconstructing theoriginal SMILES string are represented for the auto-encoders using thesigmoid activation function to calculate the latent space. Each row of thisfigure represents the accuracy of a specificmodel defined by the name onthe y axes during the training. The color encodes the value of the accuracyat each epoch. The darker the color, the lower the accuracy.

tion in each layer. Goussian noise was added at each layer of the generator

and applied to the input of the discriminator. Indeed, as suggested in [51]

and [4] instability of generative adversarial network could be caused by

non-overlapping supports of the generator and discriminator functions,

whichmay lead to the possible presence ofmultiple optimal discriminators


and so, to the invalidity of the convergence proof proposed by Goodfellow

and coworkers in [21]. The noise addition should push the supports to

better overlap and reduce instability problems. For the same reason also

label switching was used, which flips the class labels after a predefined

number of epochs. The model was optimized using the Adam and SGD

optimizationmethods and the binary cross-entropy was used tomeasure

the loss of the model. The learning rate was decreased each 25 updates.

During the training of the generator the weights of the discriminator were

frozen. The discriminator was trained for more times if its loss overcame

0.5. This was done tomaintain the discriminator near optimal and able to

provide a helpful feedback to the generator. In this case, like in the training

of the Chemo-GAN, the objective optimized is the one defined in equa-

tion 1.36. During the optimization of such objective the only moment in

which the gradient vanishes is when the generator manages tomake the

discriminatormakemistakes. Generally, this is not a real problem if the dis-

criminator is near optimal, because by the time the generator manages to

fool the discriminator the quality of the generated samples is already good

as mentioned also by [20]. Both models, the one with linear latent space

and that with the sigmoidal latent space, were trained for 24,500 updates

of 100,000 samples sampled randomly from the original data and 100,000

generated ones. From the first computations, auto-encoders having an

accuracy above 90%were obtained. Despite these results, the amount of

generated valid SMILES strings resulted to be below 0.1% for the Latent-

Space-GAN using the sigmoidal latent space and below 0.2% for the one

using the linear latent space. This can be observed in Figure 2.8, in which

the percentage of generated valid SMILES string is represented for models

saved at different point during the training of the Latent-Space-GAN using


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Generators

0.02

0.04

0.06

0.08

0.10

0.12

0.14

% V

alid

SM

ILES

strin

gs

Distribution of the percentage of valid SMILES strings per generator

Figure 2.8: Distribution of percentage of valid SMILES strings generated bygenerators saved along the training of the Latent-Space-GANwith linearlatent space. Each generator was used to generate 20,000 SMILES stringsfor 10 times. After each samples generation, the FTOXDwasmeasured andthe values obtained by each generator summarized with a boxplot.

linear latent space. At a visual inspection, it was observed that the quality

of the generated samples improves during training and that SMILES strings

invalidity is often caused bymissing parentheses or not-closed rings as it

can be observed in Table 2.3. Despite this, due to these performances, we

did not continue further with the study of the Latent-Space-GAN.


Table 2.3 Comparison between original and generated SMILES strings

Original SMILES strings Generated SMILES strings

0 c1(c2nc(N=C(N)N)sc2)cn(c(c1)C)C S.O.CC(N(F)/CCCCNCC322oc1nnncc31)F

1 C(=N)(Cc1ccc(cc1)O)c1c(cc(c(c1O)OC)O)O O=C2N(nC21CN2SN=C/C=NC[N-])OCC=CC2C=C1

2 [nH]1c2ccc(cc2cc1C(=O)OCC)C(=O)O N(CNNOC1cc1/CO/C \OC(CF)/CP(CO)OC)O

3 C(=O)(c1ccc(cc1)I)NO n1(N(C2(OC21)CCO)C)C/1OC1CCC.C(C)O

4 n1c(NC)c2c(n(cn2)C)c2c1sc(n2)SC CN=NC(NN2)C.N1C=CN1OS3=nNC1=CCCCC3C2s1

5 N1(C(=O)/C=C/C=C(/CCC=C(C)C)\C)CCCC1 n12n(ncc(n2)/N)C.Sc2=NN2N=C/1F.[O-]C

6 C(=S)(Nc1ccc2c(c1)C(=O)OC2)Nc1cccc(c1)C P1O.C2CCOCP12NNNN=C1/NCCCC#CCC#CCCC1

7 c1(cnnn1c1c(cc(cc1Cl)C(F)(F)F)Cl)CCC B.FB(CC(ON2C.N1C2Nc1O1)(CCCC1)/C)O

8 c1(c2nc(NC(=O)C(=O)O)sc2)cc(no1)Cl C1#CC(OC(C#C)C/1C#CNCCl)N.NC(F)C=O

9 n1c(nc2c(c1NCC(CC)C)NCN2Cc1ccccc1)C#N P1OOCB(F.Br)N(/CC#COC(CCCN=C22)n=C1S)C

10 c1c(c(ccc1OC)CCc1ccc(c(c1)C(=O)OC)O)OC n1(NCC2=N \CCc12cnnnnnnc2)/C21CCNCnnn1C

11 c1ccc(c(c1)C1=NOC(O1)(C)c1ccccn1)Cl S1(N(N=CC(=C(CC1)CCCCCCCF)/CC)F)(F)C

12 C1(=NCCN1)Cc1cc2c(cc1)cccc2 c1(onc(c1C.O1)COC)C#CC2NN=C33c3312

13 c1(cccc(c1C(=O)O)CCCCCCC/C=C \CCCCCC)O O1C4=C(Br)CCN1C(SNc4/C(C)(C)C)(C)N

Table 2.3 Comparison between original and generated SMILES strings.The first column represents the index, the second the SMILES stringsrepresentation of chemical compounds belonging to the ChEMBL data-setand the third one represent the generated SMILES strings.

3. Results

3.1 Results Chemo-GAN

During this first approach Chemo-GAN was successfully trained to gen-

erate molecular fingerprints which look as if they were sampled from the

original data distribution. The similarity between the two distributions

wasmeasured through the FTOXD, a new definedmetric, which offer the

possibility to measure the distance between the data distribution and the

one represented by the generator using highly relevant chemical features re-

trieved through the TOX21-model. This generative adversarial network was

implemented as a fully connected artificial neural network and paramet-

ers like: the learning rate, the number of hidden layers and the activation

functions, were tuned to obtain better performances. In Figure 3.1 results

obtained by different models using different number of hidden layers and

activation functions canbeobserved togetherwith theTanimoto coefficient

for eachmodel. On the y axis the name of themodels with the respective

parameters are represented. Each name is composed by: number of hidden

layers added to the general structure of the generator, number of hidden

layers added to the general structure of the discriminator, learning rate

used to update the generator and the discriminator, the number of updates

63

64 CHAPTER 3. RESULTS

0 200 400 600 800 1000Freché TOX21 Distance

1_1_0.01_0.01_10000_elu

1_1_0.01_0.01_10000_relu

1_1_0.01_0.01_10000_selu

1_1_0.01_0.01_10000_sigmoid

1_1_0.01_0.01_10000_tanh

1_2_0.01_0.01_10000_elu

1_2_0.01_0.01_10000_relu

1_2_0.01_0.01_10000_selu

1_2_0.01_0.01_10000_sigmoid

1_2_0.01_0.01_10000_tanh

1_3_0.01_0.01_10000_elu

1_3_0.01_0.01_10000_relu

1_3_0.01_0.01_10000_selu

1_3_0.01_0.01_10000_sigmoid

1_3_0.01_0.01_10000_tanh

Mod

els

FTOXD

0.00 0.05 0.10 0.15 0.20 0.25 0.30Tanimoto coefficient

Tanimoto

Chemo-GAN models evaluation

Figure 3.1: FTOXD and Tanimoto coefficient calculated for eachmodeltrained. The Tanimoto coefficient wasmeasured between all pairs of 500generatedmolecular fingerprints per model and these distributionsrepresented by box-plots. The FTOXDwas calculated 50 times per modelusing 10000 generatedmolecular fingerprints per time and also in this casethe 50measures were resumed using box-plots.

3.1. RESULTS CHEMO-GAN 65

and finally the activation function used in the hidden layers. i.e. the name

1_1_0.01_0.01_10000_tanh stands for a GANwith a generator and discrim-

inator having one extra hidden layer, trained using a learning rate for the

generator and the discriminator equal to 0.01 for 10000 updates and where

each hidden layer used the tanh activation function. From Figure 3.1 the

importance of the number of hidden layers used in the discriminator part

of the network can be easily observed. Indeed, the boxplots representing

the results obtained after repeated generation of molecular fingerprints

and their quality assessment using the FTOXD, can be clustered in three

groups depending on the number of extra hidden layers used in the dis-

criminator part of the model. In the upper left corner, the models using

three extra hidden layers in the discriminator obtained the best results

which are similar one another, but far better when compared to the res-

ults obtained by models using less hidden layers in the discriminator. A

second and third groups can be discerned, which containmodels having a

FTOXD in the range 300-600 and 800-900 respectively. The second group

was implemented using two hidden layers in the discriminator, while the

third one was implemented using only one hidden layer. It is also inter-

esting that the Tanimoto coefficients are in general low suggesting that

themodel is generating different molecular fingerprints showing a higher

variance with a distributionmore left skewed for the FTOXD retrieved from

themodels belonging to the first group. To better appreciate the differences

betweenmodels within each group, results of each group were represented

in different plots as can be observed in Figure 3.2. From these plots, the

effect played using different activation functions can be observed. This is

especially true for themodels represented in the second row, in which the

differences in FTOXDs aremore pronounced.


860 880 900

1_1_0.01_0.01_10000_elu

1_1_0.01_0.01_10000_relu

1_1_0.01_0.01_10000_selu

1_1_0.01_0.01_10000_sigmoid

1_1_0.01_0.01_10000_tanh

Mod

els D

1 h

idde

n la

yer

FTOXD

0.00 0.05 0.10 0.15 0.20 0.25 0.30

Tanimoto

350 400 450 500 550

1_2_0.01_0.01_10000_elu

1_2_0.01_0.01_10000_relu

1_2_0.01_0.01_10000_selu

1_2_0.01_0.01_10000_sigmoid

1_2_0.01_0.01_10000_tanh

Mod

els D

2 h

idde

n la

yers

0.00 0.05 0.10 0.15 0.20 0.25 0.30

140 150 160Fréchet TOX21 Distance

1_3_0.01_0.01_10000_elu

1_3_0.01_0.01_10000_relu

1_3_0.01_0.01_10000_selu

1_3_0.01_0.01_10000_sigmoid

1_3_0.01_0.01_10000_tanh

Mod

els D

3 h

idde

n la

yers

0.00 0.05 0.10 0.15 0.20 0.25 0.30Tanimoto coefficient

Chemo-GAN models evaluation

Figure 3.2: FTOXD and Tanimoto coefficient for model belonging to thesame group were represented in the same plot. The first row contains theresults of themodels belonging to the third group, the second row thosebelonging to the second one and the third those belonging to the first one.


0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000updates

0

2000

4000

6000

8000

10000

12000

14000

FTOX

D

FTOXD Chemo-GAN

Figure 3.3: This figure represents the distributions of the FTOXDmeasuredeach 500 updates for the Chemo-GAN architecture that obtained the bestresults. Each 500 updates, 10,000molecular fingerprints were generated 50times and the FTOXD calculated. The distributions of these values wererepresented as boxplot.

In the first row can be seen that the model using the sigmoid activation

function achieved the best results in average, while in the second row the

best results were achieved by themodel using the relu activation function,

while in the last row the best results were achieved by themodel using the

elu activation function. It is also interesting to note that the distributions

of the Tanimoto coefficient are similar within the first group and become

more andmore different when the number of hidden layers increases. In

Figure 3.4 the binary accuracy, the loss of the generator and the one of the

discriminator were depicted for themodel using the elu activation function

of each group. From this picture the behavior of the generator and discrim-


0 1000 2000 3000 4000 5000Updates

0

1

2

3

4

5

6

GAN elu 1hlG lossD lossaccuracy

0 1000 2000 3000 4000 5000Updates

0

2

4

6

8

10

12

GAN elu 2hlG lossD lossaccuracy

0 1000 2000 3000 4000 5000Updates

1

2

3

4

5GAN elu 3hl

G lossD lossaccuracy

Figure 3.4: Learning curves of the discriminator and generator andaccuracy of the discriminator calculated at each epoch of the training ofdifferent Chemo-GAN architectures. From the top to the bottom, the plotsof the results for Chemo-GANmodels using a discriminator having 1,2 and3 extra hidden layers respectively are represented for themodels using theelu activation function in the hidden layers.


0 1000 2000 3000 4000 5000

updates

1

2

3

4

5

loss

Generative lossgenerative loss

0 1000 2000 3000 4000 5000

updates

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

loss

Discriminative lossdiscriminitive loss

0 1000 2000 3000 4000 5000

updates

0.3

0.4

0.5

0.6

0.7

accu

racy

Accuracyaccuracy

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

updates

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

tani

mot

o co

effic

ient

Tanimoto coefficient

Figure 3.5: Learning curves, accuracy and Tanimoto coefficient for theChemo-GAN architecture which achieved the best results. First row:learning curves for the generator and discriminator. Second row: first plotrepresents the binary accuracy of the discriminator in samples classdiscrimination during training, while the second one represents theTanimoto coefficient measured betweenmolecules generated after eachepoch through the generator.


inator during the training can be easily perceived. Indeed, at the beginning

of the training, when the data distribution and the one represented by the

generator are different, the losses of the generator and discriminator are

high, but during the training they decrease toward zero until a certain point,

in which the discriminator is nomore able to discriminate the origin of the

samples, is reached as it can be observed in the last plot of Figure 3.4 and in

the first three ones of Figure 3.5. In this plot, at the last epochs, the accuracy

approaches 0.5 and while the loss of the generator tends to zero, the one of

the discriminator increases. In the first and second rows of Figure 3.4 the

losses of the discriminator and the generator fell to zero probably because

the training was interrupted too early. This behavior is mirrored also in the

FTOXD plot represented in Figure 3.3. Indeed, while at the beginning of

the training this distancemeasure between the two distributions is high, it

decreases during the training showing that the distributions are becoming

more andmore similar. These data clearly show that the distribution ap-

proximated by the generator get closer and closer to the one of the original

data during training. It can be further inferred that the depth of the dis-

criminator plays a key role in the quality of the approximation achieved, as

represented in Figure 3.1. Thus, the Chemo-GANwas successful in approx-

imating the original data distribution showing that generative adversarial

networks can be a powerful tool also in chemoinformatics.

4. Discussion

The development of new drugs is a process, which requires many years to

be carried out. Libraries containing chemical compounds are screened

with the aim of findingmolecules presenting the right properties and the

least sides effects. The selected candidates must go throughmany phases

before they are accepted and these phases, which generally imply testing

the candidates in the lab to assess potency, efficiency and toxicity, are highly

expensive. Toward the aimof reducing the costs and increasing the chances

for the candidates to land on themarket, in-silico approaches were used

to assess properties or activities of chemical compounds and to generate

new ones. Neural networks were widely used, together with other methods

like random forest and support vector machines, in QSAR and QSPR ex-

periments. Recurrent neural networks and variational auto-encoders have

proven their ability of generating chemical compoundswithout theneed for

rules to be hand-crafted covering in this way a wider chemical space. Since

their discovery, whichwas published in 2014, GANs have becomemore and

more popular and successful in a wide range of tasks spanning frommusic

and images to art. In this thesis it was hypothesized that GANs could be

successful also in the generation of molecular fingerprints and chemical

compounds and to prove it we implemented and applied, for the first time,

71

72 CHAPTER 4. DISCUSSION

models trained in an adversarial setting in chemoinformatics, with the aim

of generating SMILES strings andmolecular fingerprints. To evaluate the

quality of generated samples and to measure the distance between the

original data distribution and the generated data distribution a newmetric,

the FTOXD, was defined providing a tool to evaluate generative models

aiming to generated SMILES strings or molecular fingerprints especially

useful for the evaluation of generative models such as GANs, for which the

estimation of the log-likelihood is difficult [60]. Results obtained from the

Chemo-GAN showed that the FTOXD decreases during the training, sug-

gesting that the data generating distribution and the learned one become

more andmore similar, fact that is also supported by the binary accuracy,

represented in the bottom-left plot of Figure 3.5, which after having reached

a plateau phase, around 1500 epochs, drops down until it reaches 0.5 after

5000 epochs. When the accuracy is near 0.5 the discriminator is nomore

able to provide useful feedback because it cannot discriminate whether

samples are coming from the generator or from the original data. Indeed,

the FTOXDmeasure increases when themodel is trained for more epochs.

73

Figure 4.1: 100 generated chemical compounds sampled from a Gaussianprior with SD 0.4.

The same sigmoidal shaped curve was observed for all the trained mod-

els with all different parameters. The only difference lied in the point in

which theminimum is reached, which for models using a deeper discrim-

inator was lower as showed in Figure 3.1. The low Tanimoto coefficient

further indicates that themodel is producingmolecular fingerprints that


are not equal one another and shared few features. This may be due to

the type of molecular fingerprints used. Indeed, for the ECFP4, features

are calculate on the basis of the neighborhood of an atom in a diameter

of 4 bonds as explained in Chapter 2.2. It also suggests that no models

collapse happened. Despite themodels succeeded in the approximation of

the original data distribution, it cannot directly generate molecular graphs,

but chemical compounds must be "fished" from a library. However, this

can be accomplished bymeasuring the tanimoto coefficient between the

generatedmolecular fingerprints and those present in a data-set retrieving

SMILES strings of those molecular fingerprints for which the Tanimoto

coefficient was higher than a certain threshold. The Latent-Space-GANwas

designed to allow the direct generation of chemical graphs. In this case,

the encoder part of the autoencoder, mapped the SMILES strings into a

latent space, which is represented by amultidimensional vector and can be

considered as a sort of molecular fingerprint calculated through a neural

network. This latent space representation of SMILES strings was used to

train a GAN. Two latent space representations of SMILES were used: one

generated through a linear activation function and one generated through

a sigmoid activation function. The better results of the model using the

linear latent space could be explained by the fact that values in the sig-

moidal latent space tend to be positioned at the corner of a hypercube

and this fact hinder the backpropagation of the gradient slowing down

the learning process. The low percentage of valid SMILES strings when

mapping point in a latent space to SMILES strings representation is a prob-

lemwhich was already described in [19] and the reason, as they suggested,

could be the fragility of the SMILES syntax. Indeed, as pointed out in the

methods part, many SMILES strings resulted to be invalid because of a

75

missing parenthesis or open rings. Furthermore, the accuracy achieved by

the auto-encoder reached 90%and consequently, a further error sourcewas

introduced in themodel at this step. The binary accuracy wasmeasured

as the mean of the accuracy of each character predicted. Therefore, one

symbol can bemistakenly decoded into another one leading, in the case

of a mistakenly-decoded parenthesis or number, to a syntax error and the

consequent invalidity of the SMILES strings. Another source of error was

introduced by the use of the correctormodel to assign the unknown symbol

"16" to the correct ones based on context. The slightly improvement of the

percentage of valid SMILES strings represented in Figure 2.8 and the better

quality of SMILES strings observed at a visual inspection of the chemical

SMILES strings generated with a generator saved during the last epochs

of the training process suggested that a longer training could have led to

a further improvement. Despite these problems, generated valid SMILES

strings showed a low FTOXD highlighting a similarity between the original

data distribution and the one represented by the generator and a decent

quality. Therefore, it is possible to retrieve data-set composed by only valid

SMILES strings filtering the invalid ones using the RDkit. Finally, it was

observed that the percentage of valid SMILES strings increases when the

noise vector are sampled around themeanof theGaussian prior. This could

be because these values are sampledmore often. Therefore, the network

has been trained on those values more often becoming able to better map

them to valid latent space representations. Using this strategy , a higher

percentage of valid SMILES string is produced and data-set containing only

valid SMILES strings can be generated quickly. An example is represented

in 4.1 in which 100 valid SMILES strings were generated sampling noise

vectors using a SD equal to 0.4 in 5minutes on a normal laptop. While the


same amount can be generated in oneminute using a SD equal to 0.2.

0.1 0.3 0.5 0.7 1.0Gaussian prior SD

0

50

100

150

200

FTOX

DFTOXD sampling using different SD

Figure 4.2: FTOXDmeasured for the Latent-Space-GAN for valid generatedSMILES strings sampled from priors with differend SD.

Furthermore, the time needed can be extremely reduced when working

on a GPU. However, sampling from a prior with a different SD causes the

generated distribution to be slightly different as it is also confirmed from

Figure 4.2, in which the FTOXD was measured for generated molecules

using the samemodel but sampling fromGaussian prior with different SD.

5. Conclusion

This is the first time a Generative Adversarial Network was used to generate

molecular graphs andmolecular fingerprints and one of the first attempts

to use generative Deep Learningmodels for Drug Discovery [19], [8], [50].

This master thesis opens new perspectives in chemoinformatics and drug

discovery showing the suitability of GANs for chemoinformatics related

tasks and providing a new tool, the Chemo-GAN, formolecular fingerprints

generation, whichmay speed up the screening processes providing a new

way to obtainmolecules having high potential to be selected in successive

stages of the drug discovery cycle and covering a wider chemical space,

nomore limited to human drug developers and chemists. This work also

provides a new metric, the FTOXD, which considers chemical relevant

features for the evaluation of generativemodels tackling the problems of

molecular fingerprints or SMILES strings generation. It was also shown

that the use of an LSTM-auto-encoder to mapmolecular graphs from and

to a latent space is a fast alternative to molecular fingerprints or chem-

ical descriptors generation. Furthermore, Latent-Space-GANwas able to

produce SMILES strings with low FTOXD, but the percentage of valid gen-

erated SMILES strings is low. Studies could be carried out to improve these

performances reducing the source of errors introduced with the use of an

77

78 CHAPTER 5. CONCLUSION

auto-encoder and the corrector models through the use of more data and

the full SMILES characters alphabet. Further experiments could be carried

out to improve the percentage of valid SMILES strings evaluating the effect

different GAN architectures which showed improvement in realistic images

generation and the use of different priors. Another possible improvement

could be the use of multitask learning in the discriminator to learn mul-

tiple properties at the same time. Transfer learning could be also used to

allow the already-trainedmodel to learn to generate not only valid SMILES

strings, but also valid ones having particular properties or activities.

79

SupplementaryMaterial

Data and codes, which were used to implement and train themodels, are

available at:

❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴■s②✽✾✴●❆◆✲✐♥✲❈❤❡♠♦✐♥❢♦r♠❛t✐❝s

80 CHAPTER 5. CONCLUSION

Acronyms

ADALINE ADaptative LInear NEuron.

ADAMET Absorption, Distribution, Metabolism, Excretion, Toxicity.

ANN Artificial Neural Network.

API Application Programming Interface.

DCGAN Deep Convolutional Generative Adversarial Network.

DNN Deep artificial Neural Network.

ECFP Extended Connectivity Fingerprints.

FID Fréchet Inception Distance.

FNN Fully connected Neural Network.

FTOXD Fréchet Tox21 Dsitance.

GAN Generative Adversarial Network.

KL Kullback Leibler Divergence.

LS-GAN Latent-Space-GAN.

81

82 Acronyms

LSTM Long Short-TermMemory.

MAE Mean Absolute Error.

MLP Multi-layer Perceptron.

MSE Mean Squared Error.

QSAR Quantity Structure-Activity Relationship.

QSPR Quantity Structure-Properties Relationship.

RL Reinforcement Learning.

RNN Recurrent Neural Network.

SDF Structure Data Format.

SGD Stochastic Gradient Descent.

SMILES SimplifiedMolecular Input Line Entry Specification.

SML SupervisedMachine Learning.

UML UnsupervisedMachine Learning.

WLN Wiswesser Line Notation.

Bibliography

[1] Chemical hashed fingerprints. https://docs.chemaxon.com /dis-

play/docs/Chemical+Hashed+Fingerprint. Accessed: 14/11/2017.

[2] Dragon 7.0. ❤tt♣s✿✴✴❝❤♠✳❦♦❞❡✲s♦❧✉t✐♦♥s✳♥❡t✴♣r♦❞✉❝ts❴❞r❛❣♦♥✳

♣❤♣. Accessed: 2017-09-15.

[3] Pharmacophore fingerprints. https://docs.chemaxon.com/display/

docs/Pharmacophore+Fingerprint+PF. Accessed: 14/11/2017.

[4] Martin Arjovsky and Léon Bottou. Towards principled meth-

ods for training generative adversarial networks. arXiv preprint

arXiv:1701.04862, 2017.

[5] Christopher M. Bishop. Pattern Recognition andMachine Learning.

Springer, 2006.

[6] ChEMBL version 23. ❢t♣✿✴✴❢t♣✳❡❜✐✳❛❝✳✉❦✴♣✉❜✴❞❛t❛❜❛s❡s✴❝❤❡♠❜❧✴

❈❤❊▼❇▲❞❜✴r❡❧❡❛s❡s✴❝❤❡♠❜❧❴✷✸✴❝❤❡♠❜❧❴✷✸❴r❡❧❡❛s❡❴♥♦t❡s✳t①t. Ac-

cessed: 14/11/2017.

[7] Artem Cherkasov, Eugene N Muratov, Denis Fourches, Alexandre

Varnek, Igor I Baskin, Mark Cronin, John Dearden, Paola Gramatica,

Yvonne C Martin, Roberto Todeschini, et al. Qsar modeling: where

83

84 BIBLIOGRAPHY

have youbeen? where are yougoing to? Journal ofmedicinal chemistry,

57(12):4977–5010, 2014.

[8] Mehdi cherti, Balazs Kegl, and Akin kazakci. De novo drug design with

deep generative models: An empirical study, 2017/2/17.

[9] François Chollet et al. Keras. ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❢❝❤♦❧❧❡t✴❦❡r❛s,

2015.

[10] Wikipedia contributors. Chemical table file — wikipedia, the free

encyclopedia, 2018. [Online; accessed 2-February-2018].

[11] ArthurDalby, JamesGNourse,WDouglasHounshell, AnnKIGushurst,

David LGrier, BurtonALeland, and JohnLaufer. Description of several

chemical structure file formats used by computer programsdeveloped

at molecular design limited. Journal of chemical information and

computer sciences, 32(3):244–255, 1992.

[12] Brandon Amos Blog. ❤tt♣s✿✴✴❤tt♣✿✴✴❜❛♠♦s✳❣✐t❤✉❜✳✐♦✴. Accessed:

20/11/2017.

[13] Extended Connectivity Fingerprints. https://docs.chemaxon.com/

display/docs/Extended+Connectivity+Fingerprint+ECFP. Accessed:

14/11/2017.

[14] Ronen Eldan and Ohad Shamir. The power of depth for feedforward

neural networks. In Conference on Learning Theory, pages 907–940,

2016.

[15] LawrenceM Fisher. Marvinminsky: 1927-2016. Communications of

the ACM, 59(4):22–24, 2016.

BIBLIOGRAPHY 85

[16] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers,

Mark Davies, Anne Hersey, Yvonne Light, ShaunMcGlinchey, David

Michalovich, Bissan Al-Lazikani, et al. Chembl: a large-scale bioactiv-

ity database for drug discovery. Nucleic acids research, 40(D1):D1100–

D1107, 2011.

[17] Aurélien Géron. Hands-on machine learning with scikit-learn and

tensorflow: concepts, tools, and techniques to build intelligent sys-

tems, 2017.

[18] XavierGlorot andYoshuaBengio. Understanding the difficulty of train-

ingdeep feedforwardneural networks. InProceedings of the Thirteenth

International Conference on Artificial Intelligence and Statistics, pages

249–256, 2010.

[19] Rafael Gómez-Bombarelli, David Duvenaud, JoséMiguel Hernández-

Lobato, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams,

and Alán Aspuru-Guzik. Automatic chemical design using a data-

driven continuous representation of molecules.

[20] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks.

arXiv preprint arXiv:1701.00160, 2016.

[21] Ian J. Goodfellow, Jean Pouget-Abadie, MehdiMirza, Bing Xu, David

Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-

erative adversarial networks.

[22] Louis P. Hammett. Reaction rates and indicator acidities. Chemical

Reviews, 16(1):67–79, 1935.

86 BIBLIOGRAPHY

[23] Corwin. Hansch and Toshio. Fujita. ρ−σ−π analysis. amethod for the

correlation of biological activity and chemical structure. Journal of

the American Chemical Society, 86(8):1616–1626, 1964.

[24] John CHay, FCMartin, and CWWightman. Themark-1 perceptron-

design and performance. In Proceedings of the institute of radio engin-

eers, volume 48, pages 398–399, 1960.

[25] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard

Nessler, Günter Klambauer, and Sepp Hochreiter. Gans trained by

a two time-scale update rule converge to a nash equilibrium. arXiv

preprint arXiv:1706.08500, 2017.

[26] Geoffrey Hinton. Rmsprop. ❤tt♣✿✴✴✇✇✇✳❝s✳t♦r♦♥t♦✳❡❞✉✴⑦t✐❥♠❡♥✴

❝s❝✸✷✶✴s❧✐❞❡s✴❧❡❝t✉r❡❴s❧✐❞❡s❴❧❡❝✻✳♣❞❢, 2014.

[27] Geoffrey EHinton, SimonOsindero, and Yee-Whye Teh. A fast learning

algorithm for deep belief nets. Neural computation, 18(7):1527–1554,

2006.

[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-termmemory.

Neural computation, 9(8):1735–1780, 1997.

[29] Kurt Hornik. Approximation capabilities of multilayer feedforward

networks. Neural networks, 4(2):251–257, 1991.

[30] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and

Eric P Xing. Toward controlled generation of text. In International

Conference onMachine Learning, pages 1587–1596, 2017.

[31] Ian Goodfellow and Yoshua Bengio and Aaron Courville.Deep Learn-

ing. MIT Press, 2016.

BIBLIOGRAPHY 87

[32] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-

to-image translation with conditional adversarial networks. arXiv


[33] Vasanth Kalingeri and Srikanth Grandhe. Music generation with deep

learning. arXiv preprint arXiv:1612.04928, 2016.

[34] Andrej Karpathy. The unreasonable effectiveness of recurrent

neural networks. ❤tt♣✿✴✴❦❛r♣❛t❤②✳❣✐t❤✉❜✳✐♦✴✷✵✶✺✴✵✺✴✷✶✴

r♥♥✲❡❢❢❡❝t✐✈❡♥❡ss✴, 2017.

[35] David Kriesel. A Brief Introduction to Neural Networks. 2007.

[36] Andrey Kurenkov. A brief history of neural nets and deep learning,

2015.

[37] Andrew R Leach and Valerie J Gillet. An introduction to chemoinform-

atics. Springer Science & Business Media, 2007.

[38] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew

Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Jo-

hannes Totz, ZehanWang, et al. Photo-realistic single image super-

resolution using a generative adversarial network. arXiv preprint

arXiv:1609.04802, 2016.

[39] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep archi-

tectures and deep learning in chemoinformatics: the prediction of

aqueous solubility for drug-like molecules. Journal of chemical in-

formation andmodeling, 53(7):1563–1575, 2013.

[40] Andreas Mayr, Günter Klambauer, Thomas Unterthiner, and Sepp

Hochreiter. Deeptox: toxicity prediction using deep learning. Tox21

88 BIBLIOGRAPHY

Challenge to Build Predictive Models of Nuclear Receptor and Stress Re-

sponse Pathways as Mediated by Exposure to Environmental Toxicants

and Drugs, page 17, 2017.

[41] Warren SMcCulloch andWalter Pitts. A logical calculus of the ideas

immanent innervous activity. The bulletin ofmathematical biophysics,

5(4):115–133, 1943.

[42] Christopher Olah. Understanding lstm networks. ❤tt♣✿✴✴❝♦❧❛❤✳

❣✐t❤✉❜✳✐♦✴♣♦sts✴✷✵✶✺✲✵✽✲❯♥❞❡rst❛♥❞✐♥❣✲▲❙❚▼s✴, 2017.

[43] Ennio Pannese. The golgi stain: invention, diffusion and impact on

neurosciences. Journal of the history of theneurosciences, 8(2):132–140,

1999.

[44] KarlHPribram. Theneuropsychology of sigmund freud. Experimental

foundations of clinical psychology, pages 442–468, 1962.

[45] David Rogers andMathewHahn. Extended-connectivity fingerprints.

Journal of chemical information andmodeling, 50(5):742–754, 2010.

[46] David J Rogers, Taffee T Tanimoto, et al. A computer program for

classifying plants. Science, 132(3434):1115–1118, 1960.

[47] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learn-

ing representations by back-propagating errors. Cognitive modeling,

5(3):1, 1988.

[48] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec

Radford, and Xi Chen. Improved techniques for training gans. In

Advances in Neural Information Processing Systems, pages 2234–2242,

2016.

BIBLIOGRAPHY 89

[49] Ryusuke Sawada, Masaaki Kotera, and Yoshihiro Yamanishi. Bench-

marking a wide range of chemical descriptors for drug-target interac-

tion prediction using a chemogenomic approach. Molecular inform-

atics, 33(11-12):719–731, 2014.

[50] Marwin H. S. Segler, Thierry Kogej, Christian Tyrchan, and Mark P.

Waller. Generating focussedmolecule libraries for drug discovery with

recurrent neural networks.

[51] Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and

Ferenc Huszár. Amortisedmap inference for image super-resolution.

arXiv preprint arXiv:1610.04490, 2016.

[52] Richard S Sutton and Andrew G Barto. Reinforcement learning: An

introduction, volume 1. MIT press Cambridge, 1998.

[53] Symex. Ctfile formats. ❤tt♣✿✴✴✐♥❢♦❝❤✐♠✳✉✲str❛s❜❣✳❢r✴r❡❝❤❡r❝❤❡✴

❉♦✇♥❧♦❛❞✴❋r❛❣♠❡♥t♦r✴▼❉▲❴❙❉❋✳♣❞❢, 2010.

[54] Roberto Todeschini and Viviana Consonni. Molecular descriptors for

chemoinformatics, volume 41 (2 volume set), volume 41. JohnWiley &

Sons, 2009.

[55] Alexandre Varnek. Tutorials in Chemoinformatics. JohnWiley & Sons,

2017.

[56] Paul Werbos. Backwards differentiation in ad and neural nets: Past

links and new opportunities. Automatic differentiation: Applications,

theory, and implementations, pages 15–34, 2006.

[57] Bernard Widrow et al. Adaptive" adaline" Neuron Using Chemical"

memistors.". 1960.

90 BIBLIOGRAPHY

[58] Bernard Widrow and Michael A Lehr. 30 years of adaptive neural

networks: perceptron, madaline, and backpropagation. Proceedings

of the IEEE, 78(9):1415–1442, 1990.

[59] Wikipedia. Simplifiedmolecular-input line-entry system—wikipedia,

the free encyclopedia, 2017. [Online; accessed 14-November-2017].

[60] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On

the quantitative analysis of decoder-based generative models. arXiv


[61] Santiago Ramón y Cajal. Estructura de los centros nerviosos de las aves.

1888.

[62] ChunWei Yap. Padel-descriptor: An open source software to calculate

molecular descriptors and fingerprints. Journal of Computational

Chemistry, 32(7):1466–1474, 2011.

[63] Rafael Yuste. From the neuron doctrine to neural networks. Nature

Reviews Neuroscience, 16(8):487–497, 2015.

Curriculum Vitae Isaac Lazzeri

Graz, am 07.01.2018

PERSONAL INFORMATION

Name: Isaac Lazzeri

Address: Idlhofgasse 36/7

8020 Graz

E-mail: [email protected]

Tel.: +43 650 6726227

Date of birth: 18.12.1989

Nationality: Italian

EDUCATION AND TRAINING

Since 15/04/2015 Master’s degree program in Bioinformatics (Johannes

Kepler Universität Linz)

11/09/2016 – 16/09/2016 Summer school “Advanced School on Modelling and

Statistics for Biology, Biochemistry and Biosensing”

(Johannes Kepler Universität Linz)

24/03/2014 Bachelor’s degree in Biotechnology (Final grade: 106/110)

(Università degli studi dell’Insubria, Varese/Italy)

Thesis heading: “Topological structure of disease

associated molecular networks”

01/04/2013 – 04/10/2013

Erasmus Placement (Emergentec Biodevelopment GmbH

Vienna)

15/09/2011 – 17/07/2012 Erasmus Project at the University of Salamanca/Spain

WORK EXPERIENCE

01/03/2017 – 01/07/2017 Tutor (Machine Learning: Unsupervised Techniques, JKU)

01/10/2016 – 01/03/2017 Tutor (Machine Learning: Supervised Techniques, JKU)

11/06/2014 – 05/09/2014 Personal Assistance (Liverpool/England)

01/04/2013 – 04/10/2013 Erasmus Placement (Emergentec Biodevelopment GmbH

Vienna)

2011 Administrative activities – Università dell’Insubria

(Varese/Italy)

01/08/2008 – 02/02/2009 Administrative activities – AVIS (Associazione Volontari

Italiani Sangue, Varese/Italy)

Curriculum Vitae Isaac Lazzeri

Graz, am 07.01.2018

SCHOOL EDUCATION

2003 – 2008 Liceo Artistico A. Frattini (Varese/Italy)

OTHER SKILLS

Languages: Italian: mother tongue

English: C1

Spanish: C1

German: B2

Computer Skills: Linux, Windows

Programming languages: R, Python, Perl

Technologies: Microsoft Office, Latex, Keras, Tensorflow, Pandas,

Numpy, SQL, HTML, XML, XPath, XQuery, WEKA

LANGUAGE COURSES

01/10/2016 – 28/02/2017 German immersion course B2 (JKU, Linz)

27/10/2014 – 18/12/2014 German immersion course B1 (DIG, Graz)

16/06/2014 – 22/08/2014

03/2013

English immersion course C1 (LILA*, Liverpool/England)

German immersion course (Deutsch-Akademie, Vienna)

INTERESTS

Playing the guitar, travelling, drawing, languages, sports (football, breakdance, basketball, trekking), cooking.

Eidesstattliche Erklärung

Ich erkläre an Eides statt, dass ich die vorliegende Masterarbeit selbst-

ständigundohne fremdeHilfe verfasst, anderealsdie angegebenenQuellen

und Hilfsmittel nicht benutzt bzw. die wörtlich oder sinngemäß entnom-

menen Stellen als solche kenntlich gemacht habe. Die vorliegendeMaster-

arbeit ist mit dem elektronisch übermittelten Textdokument identisch.

Datum undUnterschrift

08.02.2018 Isaac Lazzeri

Date post:	16-Oct-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Arti cial intelligence in drug design: generative ...

Documents