+ All Categories
Home > Science > Chemical features: how do we describe a compound to a computer?

Chemical features: how do we describe a compound to a computer?

Date post: 14-Apr-2017
Category:
Upload: richard-lewis
View: 217 times
Download: 0 times
Share this document with a friend
15
Chemical Features: how do we describe a compound to a computer? Richard Lewis Centre for Molecular Informatics Friday 22 nd January 2016
Transcript
Page 1: Chemical features: how do we describe a compound to a computer?

Chemical Features: how do we describe a compound to a computer?

Richard Lewis

Centre for Molecular Informatics

Friday 22nd January 2016

Page 2: Chemical features: how do we describe a compound to a computer?

Introduction and synopsis

• What is a molecule?

• Techniques to represent a molecule as a data structure

• Substructure features

• Circular Fingerprints

• Conclusions

Page 3: Chemical features: how do we describe a compound to a computer?

What is a molecule?

• ”A group of atoms bonded together”

• Often represented by sketches:

• This shows the atoms, how they are bonded, and gives a good idea of their spatial layout.

A

Page 4: Chemical features: how do we describe a compound to a computer?

What sort of features can we use?

• As an example, the compound on the left are all ‘acidic’, whereas the ones on the right are not:

• We might deduce by eye that this is because they feature the ‘carboxylic acid’ motif:

“Acidic” “Not acidic”

Page 5: Chemical features: how do we describe a compound to a computer?

But what did we do?

• We identified a substructure that frequently occurred in the ’active’ class, but not in the ‘inactive’ class.

• In other words, we identified a feature, and used it to build a model in our head!

• In a more realistic problem, there may be many motifs that may affect the class of a given molecule.

• Therefore, to generalize the approach, we could build up a feature set, or long list of substructures that we expect may affect the class of a given molecule.

• In chemistry, we call substructures that tend to influence a compounds actions functional groups, so we could use a list of functional groups as a feature set.

• But how can we write a program to decide if a compound contains these functional groups?

Page 6: Chemical features: how do we describe a compound to a computer?

Step 1: How to represent a molecule to a computer?

• Sketches are very intuitive for humans to work with.

• However, this pictorial representation is not very useful to a computer!

Page 7: Chemical features: how do we describe a compound to a computer?

Molecule == Graph

• It doesn’t take much imagination to decide on an appropriate data structure to represent a compound.

• Molecule => Graph

• Atom => Node

• Bond => Edge

• Note that both atoms and bonds are labelled, so the nodes and edges in the data structure must also be distinguishable.

Page 8: Chemical features: how do we describe a compound to a computer?

Step 2: How to find a substructure in a graph?

• Use graph theory

Page 9: Chemical features: how do we describe a compound to a computer?

Quick interlude to graph theory: Subgraphs

• H is a subgraph of G if

• the nodes of H are a subset of the nodes of H

• the edges of H are a subset of the edges of G

G H

H is a subgraph of G

Page 10: Chemical features: how do we describe a compound to a computer?

Quick interlude to graph theory: Isomorphism

• Two graphs, G and H, are said to be isomorphic if there exists a 1 to 1 mapping function, f, between between their nodes such that the adjacencies are preserved.

Images from wikipedia

f(a) = 1f(b) = 6f(c) = 8f(d) = 3f(g) = 5f(h) = 2f(i) = 4f(j) = 7

G H

Page 11: Chemical features: how do we describe a compound to a computer?

Subgraph Isomorphism

• The problem we face is the ‘subgraph isomorphism’ problem:

• Two graphs, G and H, are given as input.

• Determine whether G contains a subgraph that is isomorphic to H

• In our case, H will be the functional group, and G the molecules graph.

• This problem is NP-complete! Fortunately, molecular graphs are small enough that we can still solve the typical problem for our use.

• There are a number of algorithms to achieve this; VF2[1] is preferred.

• An example of a feature set used in cheminformatics are MACCS keys*.

G H

[1] DOI: 10.1109/TPAMI.2004.75 * although some keys are not subgraphs

Page 12: Chemical features: how do we describe a compound to a computer?

Problems with substructure features

• There are two main issues with this approach:

1. The algorithmic complexity.

2. The requirement of a predefined list of substructures to serve as a feature basis

• Having to choose the feature list is somewhat circular – if we knew the relevant features, we wouldn’t need to do machine learning at all!

• However, the choice of features may greatly affect results – so it is still important to choose a good set, and a generic set is unlikely to be optimal.

Page 13: Chemical features: how do we describe a compound to a computer?

Circular Fingerprints: procedurally generated features

• A solution is to procedurally generate features.

• A family of procedural feature generation algorithms, named circular fingerprints, are used very frequently in cheminformatics.

• They record ‘environments’ at a series of bond distances away from an atom, up to a maximum distance, R, and hash them into a vector of length, L.

0: C

1: C, N, N

2: C, C, C, C, N

for atom, a in molecule: for r = 0 to R:

i 0 1 2 3 4 5 6 …f 0 0 0 0 0 0 0 …

8201Hash

5

1

mod2048

Repeat for other atoms

Page 14: Chemical features: how do we describe a compound to a computer?

Circular Fingerprints

• Benefits:

• We can transform any molecular graph to an arbitrary length binary feature vector, with arbitrarily detailed features, without needing to specify features!

• The algorithm is of lower complexity

• Issues:

• Lose the tight coupling of substructures with index of the feature vector.

• But we can also keep track of which atom, at which radius, gets mapped to which bit in the feature vector, so we retain knowledge of the environment each bit encodes.

Page 15: Chemical features: how do we describe a compound to a computer?

Conclusion

• Molecules may be thought of as a molecular based-graph

• This allows the application of techniques developed in the rich field of graph theory.

• We have seen two techniques of how to produce machine interpretable fixed-length vector representations of molecules.

• This paves the way for a variety of approaches:

• Similarity based comparisons

• Use in machine learning.


Recommended