+ All Categories
Home > Documents > Maximum Common Substructure-Based Data Fusion in Similarity …€¦ · applications of the MCS...

Maximum Common Substructure-Based Data Fusion in Similarity …€¦ · applications of the MCS...

Date post: 07-Jun-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
9
Maximum Common Substructure-Based Data Fusion in Similarity Searching Edmund Duesbury,* John Holliday,* and Peter Willett* Information School, University of Sheeld, 211 Portobello, Sheeld S1 4DP, United Kingdom ABSTRACT: Data fusion has been shown to work very well when applied to ngerprint-based similarity searching, yet little is known of its application to maximum common substructure (MCS)-based similarity searching. Two similarity search applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the bonds in the two molecules being compared, are used in a similarity coecient. The power of this technique can be extended using data fusion, where the MCS similarities of a set of reference molecules against one database molecule are fused. This group fusiontechnique forms the rst application of the MCS in this work. The other application is that of the chemical hyperstructure. The hyperstructure concept is an alternative form of data fusion, being a hypothetical molecule that is constructed from the overlap of a set of existing molecules. This paper compares ngerprint group fusion (extended-connectivity ngerprints), MCS similarity group fusion, and hyperstructure similarity searching, and describes their relative merits and complementarity in virtual screening. It is concluded that the hyperstructure approach as implemented here is less generally eective than conventional ngerprint approaches. INTRODUCTION Similarity-based approaches to virtual screening are very widely used in lead discovery programmes in the pharmaceutical and agrochemical industries. 1,2 In its simplest form, a known bioactive reference structure is matched against each of the structures in a chemical database to produce a ranking. The top-ranked molecules are those that are structurally most similar to the reference structure, using some quantitative measure of similarity, and are thus assumed to have the greatest likelihood of activity. Similarity searching is normally conducted using 2D ngerprints. 3 While these have been shown to provide both an eective and an ecient way of computing molecular similarity, they are clearly a very simple type of structural representation, and there has hence been much interest in alternative similarity measures based on 1D, 2D, or 3D information of various kinds. 4 One such approach is based on the encoding of molecules in chemical databases as labeled graphs, so that similarity searching can be implemented using the maximum common subgraph (MCS), which is often referred to as the maximum common substructure in chemoinformatics. Algorithms that nd the MCS align the graph representing the reference structure with the graphs representing each of the database structures, nding the database molecules that have the largest substructure(s) in common with the reference structure. If not one but several reference structures are available, then the results of searches for each of the individual structures can be combined using the methods of data fusion. 5 These yield consensus rankings that often exhibit a greater degree of clustering of actives at the top of the ranking than can be obtained using a single reference structure. Data fusion can also be used to combine the rankings resulting from the use of multiple similarity measures; however, in this paper, we focus on the use of multiple reference structures, an approach that has been called group fusion. 5 An alternative way of exploiting multiple reference structures in ngerprint-based similarity searching is to combine their individual ngerprints into a single consensus ngerprint. 6,7 This combines the representa- tions of the reference structures rather than the rankings resulting from their use as reference structures. A concept parallel to the consensus ngerprint used by Shemetulskis et al. 7 is that of the chemical hyperstructure where rather than fusing ngerprints into one the actual chemical graphs themselves are fused into one chemical graph. In graph theory terms, the hyperstructure is a chemical abstraction of the supergraphconcept. 8 The hyperstructure concept originates from multiple independent sources, 911 although these will not be reviewed here. Research on hyperstructures at Sheeld has stemmed from the work of Vladutz and Gould 12 who proposed that hyperstructures be used to increase the eciency of substructure searching. Brown et al. 13,14 utilized the maximum common substructure (MCS) in hyperstructure mapping and construction, both studies utilizing genetic algorithms to nd the MCS. Hyperstructures, however, were found to be unsatisfactory from a virtual screening context when used in substructural analysis, being consistently outperformed by UNITY ngerprints in retrieving active compounds. 14 The reasons for the poor performance were unclear, although one proposed reason was that chemically nonmeaningful artifacts (termed ghost substructures) were present in the hyper- Received: September 19, 2014 Published: January 20, 2015 Article pubs.acs.org/jcim © 2015 American Chemical Society 222 DOI: 10.1021/ci5005702 J. Chem. Inf. Model. 2015, 55, 222230
Transcript
Page 1: Maximum Common Substructure-Based Data Fusion in Similarity …€¦ · applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the

Maximum Common Substructure-Based Data Fusion in SimilaritySearchingEdmund Duesbury,* John Holliday,* and Peter Willett*

Information School, University of Sheffield, 211 Portobello, Sheffield S1 4DP, United Kingdom

ABSTRACT: Data fusion has been shown to work very wellwhen applied to fingerprint-based similarity searching, yet littleis known of its application to maximum common substructure(MCS)-based similarity searching. Two similarity searchapplications of the MCS will be focused on here. Typically,the number of bonds in the MCS, as well as the bonds in thetwo molecules being compared, are used in a similaritycoefficient. The power of this technique can be extended usingdata fusion, where the MCS similarities of a set of reference molecules against one database molecule are fused. This “groupfusion” technique forms the first application of the MCS in this work. The other application is that of the chemicalhyperstructure. The hyperstructure concept is an alternative form of data fusion, being a hypothetical molecule that isconstructed from the overlap of a set of existing molecules. This paper compares fingerprint group fusion (extended-connectivityfingerprints), MCS similarity group fusion, and hyperstructure similarity searching, and describes their relative merits andcomplementarity in virtual screening. It is concluded that the hyperstructure approach as implemented here is less generallyeffective than conventional fingerprint approaches.

■ INTRODUCTIONSimilarity-based approaches to virtual screening are very widelyused in lead discovery programmes in the pharmaceutical andagrochemical industries.1,2 In its simplest form, a knownbioactive reference structure is matched against each of thestructures in a chemical database to produce a ranking. Thetop-ranked molecules are those that are structurally mostsimilar to the reference structure, using some quantitativemeasure of similarity, and are thus assumed to have the greatestlikelihood of activity. Similarity searching is normally conductedusing 2D fingerprints.3 While these have been shown to provideboth an effective and an efficient way of computing molecularsimilarity, they are clearly a very simple type of structuralrepresentation, and there has hence been much interest inalternative similarity measures based on 1D, 2D, or 3Dinformation of various kinds.4 One such approach is based onthe encoding of molecules in chemical databases as labeledgraphs, so that similarity searching can be implemented usingthe maximum common subgraph (MCS), which is oftenreferred to as the maximum common substructure inchemoinformatics. Algorithms that find the MCS align thegraph representing the reference structure with the graphsrepresenting each of the database structures, finding thedatabase molecules that have the largest substructure(s) incommon with the reference structure.If not one but several reference structures are available, then

the results of searches for each of the individual structures canbe combined using the methods of data fusion.5 These yieldconsensus rankings that often exhibit a greater degree ofclustering of actives at the top of the ranking than can beobtained using a single reference structure. Data fusion can alsobe used to combine the rankings resulting from the use of

multiple similarity measures; however, in this paper, we focuson the use of multiple reference structures, an approach thathas been called group fusion.5 An alternative way of exploitingmultiple reference structures in fingerprint-based similaritysearching is to combine their individual fingerprints into asingle consensus fingerprint.6,7 This combines the representa-tions of the reference structures rather than the rankingsresulting from their use as reference structures.A concept parallel to the consensus fingerprint used by

Shemetulskis et al.7 is that of the chemical hyperstructure whererather than fusing fingerprints into one the actual chemicalgraphs themselves are fused into one chemical graph. In graphtheory terms, the hyperstructure is a chemical abstraction of the“supergraph” concept.8 The hyperstructure concept originatesfrom multiple independent sources,9−11 although these will notbe reviewed here.Research on hyperstructures at Sheffield has stemmed from

the work of Vladutz and Gould12 who proposed thathyperstructures be used to increase the efficiency ofsubstructure searching. Brown et al.13,14 utilized the maximumcommon substructure (MCS) in hyperstructure mapping andconstruction, both studies utilizing genetic algorithms to findthe MCS. Hyperstructures, however, were found to beunsatisfactory from a virtual screening context when used insubstructural analysis, being consistently outperformed byUNITY fingerprints in retrieving active compounds.14 Thereasons for the poor performance were unclear, although oneproposed reason was that chemically nonmeaningful artifacts(termed “ghost substructures”) were present in the hyper-

Received: September 19, 2014Published: January 20, 2015

Article

pubs.acs.org/jcim

© 2015 American Chemical Society 222 DOI: 10.1021/ci5005702J. Chem. Inf. Model. 2015, 55, 222−230

Page 2: Maximum Common Substructure-Based Data Fusion in Similarity …€¦ · applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the

structures that resulted from the mappings of otherwisestructurally different features between molecules. However,the random and nondeterministic nature of the geneticalgorithms used may also have played a part in this.In this paper, we discuss the use of multiple reference

structures for graph-based similarity searching using both typesof fusion: fusing the rankings resulting from searches that usethe individual chemical graphs representing each of thereference structures; and fusing the individual chemical graphsinto a single consensus graph, viz., a hyperstructure as discussedfurther below. Specifically, we report MCS-based similaritysearches using both multiple reference structures and hyper-structures and compare the results with those obtained usingconventional fingerprint-based group fusion.

■ MATERIALS AND METHODSHardware and Software. The hardware used in this study

featured an Intel(R) Core(TM) i7−2600 CPU @ 3.40 GHzprocessor with 16 GB of DDR3 RAM clocked at 1333 MHzrunning Kubuntu 13.10. The Konstanz Information Miner(KNIME) 2.9.215,16 running Java 1.6 was used for allexperimental aspects in this study, and the ChemistryDevelopment Kit 1.5.3 (CDK) was used for all chemo-informatics functionality unless otherwise noted.17 A 64-bit R3.0.118 was used to calculate all univariate statistics reported,and the hyperstructure construction and search software wasdeveloped in Java for use with KNIME.Data Sets. Three data sets have been used for the purposes

of this study: MDDR, WOMBAT, and MUV. The MDDR andWOMBAT data sets have been used in much previous work,both at Sheffield and in the case of MDDR elsewhere. TheMDDR data set contains 102,540 compounds, and WOMBATcontains 138,049 compounds. With MDDR, 11 activity classeswere used as active molecules, yielding 8184 unique activemolecules. Molecules not belonging to the said activity classwere treated as inactive. The same processes were also appliedto 14 WOMBAT activity classes, yielding 8767 unique actives.The MUV data set19 involves compounds from 17 activityclasses, with 30 actives in each activity class. MUV was includedas an alternative benchmark due to its design consideration ofdistance equivalence. The molecules in the MUV activityclasses have been selected to be similar to their chosen decoymolecules, thus avoiding analogue bias that often characterizesother data sets. Full details of these three data sets are providedby Hert et al.,6Arif et al.,20 and Rohrer and Baumann,19

respectively.For each activity class, 10 maximally diverse compounds were

selected as a training set using the MaxMin algorithm asimplemented in the KNIME version of RDKit with a randomlyassigned seed.21 These 10 molecules would be subsequentlyremoved from the data set to remove self-similarity bias fromthe virtual screening statistics. RDKit standard Morganfingerprints were used with a maximum radius of 3 (similarto the ECFP_6 fingerprint found in Pipeline Pilot22 folded into1024 bits) as the fingerprint descriptor in this study.MCS Definition and Methodology. The MCS referred to

in this work is the maximum common edge-inducedsubstructure (MCES), as opposed to the maximum commoninduced subgraph, which yields smaller common mappings andis less chemically intuitive.23 The MCES can be furtherabstracted into the connected (cMCES) and disconnected MCES(dMCES). The cMCES (Figure 1a) is the single MCES graph,where all the nodes in the subgraph are connected to at least

one other node in the subgraph. The dMCES (Figure 1b) bycontrast, sometimes known as the maximum overlap set(MOS), can contain multiple (separated) subgraphs, represent-ing all the edges in common between the graphs beingmatched. In this study, the MCS will refer to the dMCES at allpoints, which we are using as it is better suited to comparingstructurally dissimilar graphs (as shown in Figure 1, wherebold-facing denotes common substructures).The MCES was found using the MaxCommonSubstructure

class in JChem 6.1.0, 2014 by ChemAxon.24,25 This algorithmhas been claimed by its authors to be the quickest inexactmethod for finding the MCES between two molecules and alsoincorporates a number of heuristics to adjust the mappings,making the alignments more chemically feasible.26

Hyperstructure Construction and Application. Theconstruction process used here is based on the processdescribed by Brown et al.;27 an example is depicted in Figure2. The process is as follows:

(1) Select the largest molecule and remove it from the set ofavailable molecules. This molecule is now the firsthyperstructure.

(2) Select the next most similar molecule based on thenumber of bonds in common between the hyperstructureand this molecule.

(3) Use the MCS procedure to overlap and then to appendthis molecule to the hyperstructure and remove thismolecule from the set. Bonds of different types may be

Figure 1. MCS types between two molecules differing by a singlecentral atom. 1a is the connected maximum common substructure,whereas 1b is the disconnected maximum common substructure.

Figure 2. Hyperstructure construction process with two molecules,shown in blue and red; bold bonds indicate those in the MCS betweenthe two molecules. In the resulting hyperstructure, black bonds andatoms indicate the MCS, while unique bonds and atoms are coloredbased on the original molecules.

Journal of Chemical Information and Modeling Article

DOI: 10.1021/ci5005702J. Chem. Inf. Model. 2015, 55, 222−230

223

Page 3: Maximum Common Substructure-Based Data Fusion in Similarity …€¦ · applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the

overlapped, yielding degenerate bond types (depicted bydashed lines in the hyperstructures).

(4) Repeat steps 2 and 3 until the set is empty.

The Tanimoto coefficient is the standard similaritycoefficient for quantifying chemical structural similarity andwas used for the MCS and fingerprint scores described in thenext section. However, the asymmetric Tversky coefficient wasused for the hyperstructure similarity calculations. This is moresuitable than the Tanimoto coefficient for similarity searchingusing hyperstructures due to the size differences (notably in thiswork, the number of bonds) because the hyperstructure will beat least as large as the biggest molecule used to build thehyperstructure. We also expect substructure similarity to play alarge role in similarity searching because the hyperstructurecollectively represents the scaffolds present in the inputmolecules. The Tversky coefficient can be biased towardsubstructure similarity. The Tversky coefficient is defined as

β ββ β=

− + − − += → = →S c

a c b c ccb

ca( ) (1 )( )

0 1Tv

where c in this case represents the number of edges in theMCS, a is the number of bonds in the database molecule, and bthe number of hyperstructure bonds. Higher values of β give anear substructure-like search, whereas lower values bias theresults toward a superstructure-like search (as exemplified inFigure 3). One should note that exclusion of the terms β and (1

− β) yields the Tanimoto coefficient. In internal studies, severalvalues of β have been tested, and the value of 0.9 emerged asthe most generally suitable for virtual screening recall. Thisvalue has been found also to be beneficial in similarity searchingusing fingerprints28,29 and has also shown potential in scaffoldhopping with fingerprints.30

Similarity Search Methodology. The reference sets ofactive molecules mentioned were subjected to one of threesearch methods:

• Hyperstructure construction and searching, using theTversky coefficient with a β of 0.9 as discussed above

• MCS group fusion using the Tanimoto coefficient (withthe MAX rule on similarity scores)

• Morgan fingerprint group fusion using the Tanimotocoefficient (with the MAX rule on similarity scores)

Data fusion was also implemented on the rankings obtainedwith fingerprints, MCS, and hyperstructures. Given a set ofsimilarity values (or ranks) S1, S2, ... Sn, the MAX fusion ruleidentifies the maximum similarity value S. The SUM fusion ruleby contrast is the summation of S1, S2, ... Sn. SUM fusion of theranks were used here, as this combination rule is moreappropriate when fusing different similarity measures.5 Asummary of method names (including fusion types) aredescribed in Table 1.

Evaluation Metrics. Two measures have been used toquantify the effectiveness of screening. The first is theenrichment factor (EF) for the top 1% of ranked compounds(shown as EF1%). EF is a simple statistic to interpret fordetermining recall. We appreciate however that the EF does notaccount for the relative rank of compounds. Two activity classesin MDDR also have a number of actives that exceed 1% of theproportion of the database (renin inhibitors and substance Pantagonists have 1130 and 1246 active compounds, respec-tively). We have therefore chosen to use the Boltzman-enhanced ROC score (BEDROC) as an alternative evaluationscore. BEDROC scales from 0.0 to 1.0, where a value closer to1.0 indicates superior virtual screening performance.31

A small problem with BEDROC is the need to set a tuningfactor α, which determines how many of the top-rankedcompounds contribute to the BEDROC score. A higher valueof α means that a smaller percentage of the top-rankedcompounds contributes to the majority of the BEDROC score.A value of 160.9 for α has been chosen in this study as itcorresponds to EF1%, where approximately 80% of theBEDROC score is explained by the top 1% of compounds inthe ranked list (refer to Table 2 of Truchon and Bayly31). Itshould be noted that while it has been shown that BEDROCand EF are often strongly correlated,32 BEDROC takes accountof the ratio of actives to inactives, where EF does not. The samestudy notes that the two measures are uncorrelated when thisratio differs between activity classes.Enrichment involves retrieving as many active compounds as

possible, but it is often just as important in a virtual screeningcontext to find a few structurally dissimilar compounds ratherthan a large set of close analogues. A method that identifiesmultiple different “cores” or “scaffolds” is said to be proficientat scaffold-hopping. Scaffolds in this work are represented byBemis−Murcko frameworks with bond and atom labelsremoved33 but with R-groups kept when attached to linkeratoms (as is the method for the RDKit definition of Murckoframeworks in KNIME). Two statistics will be presented in thisstudy to assess the ability of a method to obtain uniquescaffolds. The first is the “First-Found” scaffold enrichmentfactor, using the top 1% of the ranked list of molecules(represented here as EFFF1%). “First-Found” refers to the rankof the top-ranked molecule belonging to a scaffold and ignoringall subsequent molecules belonging to the scaffold. Althoughthis measure of obtaining diversity has been criticized forseveral statistical flaws,34 we use it here as we are onlyinterested in whether an active scaffold is identified or not in aranked list. We are not interested in how the compound ranksare distributed for the given active scaffold. This represents a

Figure 3. Example of how the value of β influences similarity in theTversky coefficient. Edges in the MCS are marked in bold.

Table 1. Descriptions of Abbreviations of Search MethodsEmployed Here

method description

HS hyperstructure search applied using Tversky similaritycoefficient, with a β of 0.9

FP fingerprint Tanimoto similarity with the MAX fusion ruleapplied to the similarity scores

MCS Tanimoto similarity (based on the bonds in the MCS and thetwo structures being compared), with the MAX fusion ruleapplied to the similarity scores

FP MCS SUM fusion of FP and MCS ranksFP HS SUM fusion of FP and HS ranksMCS HS SUM fusion of MCS and HS ranksFP MCS HS SUM fusion of FP, MCS, and HS ranks

Journal of Chemical Information and Modeling Article

DOI: 10.1021/ci5005702J. Chem. Inf. Model. 2015, 55, 222−230

224

Page 4: Maximum Common Substructure-Based Data Fusion in Similarity …€¦ · applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the

common goal in a virtual screening project if one is seeking newscaffolds. The other measure is the mean number of activemolecules per active scaffold in the top 1% ranking(represented here as F1%). A lower value for F1% indicatesthat a search method retrieves on average less actives perscaffold and is therefore less biased toward finding analoguesfor a small number of scaffoldsTo get an idea of whether the search techniques retrieved

different sets of molecules, we calculated a number ofphysicochemical descriptors. These were calculated for theactive compounds in the top 1% of the ranked databaseresulting from each similarity search method. The descriptorsused were the counts of the number of rotatable bonds, heavyatoms, heteroatoms, and rings; the length of the largest acyclicchain; and the fragment complexity. This last descriptor was the

CDK implementation of the work described by Nilakantan etal.35

= | − + | +C B A A H/1002 2

where A is the number of non-hydrogen atoms, B is the numberof bonds, C is the fragment complexity, and H is the number ofheteroatoms.To assess the overlap between the ranked lists resulting from

two different search methods, we calculated the number ofactives common to the top 1% of the two ranked lists.

■ RESULTS AND DISCUSSIONFigure 4 summarizes recall performances for the similaritysearch methods tested, and Figure 5 summarizes scaffold-retrieval, averaged over the different activity classes in each data

Figure 4. Bar charts showing mean recall statistics for the data sets. Dark-colored bars indicate that the method has a mean significantly differentfrom that of FP (p ≤ 0.05), as determined by a paired 2-tailed Wilcoxon signed-rank test. Error bars represent one standard deviation above andbelow the mean.

Journal of Chemical Information and Modeling Article

DOI: 10.1021/ci5005702J. Chem. Inf. Model. 2015, 55, 222−230

225

Page 5: Maximum Common Substructure-Based Data Fusion in Similarity …€¦ · applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the

set. The immediate conclusion from the statistics tables is, oncomparing the mean values of BEDROC and EF1%, that FP issignificantly superior to HS, with MCS lying between the two.This suggests that fingerprints are better suited to virtualscreening, the p-values for the Wilcoxon signed-rank tests beingsignificant. In all three data sets the fingerprints significantly(and consistently) outperformed hyperstructures in virtualscreening ability. The observation that fingerprints generallymatch or outperform MCS-based methods is consistent withthe related work of Raymond and Willett,36 although in thiswork, the fingerprints used are of a better standard in terms ofvirtual screening recall than those used by Raymond andWillett.36 A further conclusion from the figures is, as has beennoted by previous studies, that the MUV data set presents a

much harder challenge for ligand-based virtual screeningmethods such as those presented here because the performanceis inferior compared to the other two data sets.The data fusion techniques show improved BEDROC scores

over the results for HS and MCS, although none of them aresuperior to FP alone. There exists only one technique that FPdoes not consistently outperform, this being FP MCS, althoughthe time requirements (see next section) to calculate the MCSin this technique make FP preferable.FP interestingly outperforms all the other techniques in

terms of EFFF1% as for BEDROC, implying that fingerprintgroup fusion is good for scaffold hopping. Effective scaffoldhopping has been observed with single reference structures,37,38

thus, it is unsurprising that group fusion of fingerprints also

Figure 5. Bar charts showing mean scaffold-retrieving statistics for the data sets. Dark-colored bars indicate that the method has a mean significantlydifferent from that of FP (p ≤ 0.05) as determined by a paired 2-tailed Wilcoxon signed-rank test. Error bars represent one standard deviation aboveand below the mean.

Journal of Chemical Information and Modeling Article

DOI: 10.1021/ci5005702J. Chem. Inf. Model. 2015, 55, 222−230

226

Page 6: Maximum Common Substructure-Based Data Fusion in Similarity …€¦ · applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the

yields a favorable scaffold hopping potential. Although HSgenerally retrieves a lower number of unique active scaffolds, itcan be seen that HS almost always obtains a significantly lowerF1% than the FP and MCS methods. From this, it can beinferred that the top-ranked active molecules retrieved byhyperstructures have less analogues per scaffold than thoseretrieved with fingerprints (in relation to the number of activesretrieved). MCS by comparison has no significant difference inscaffold retrieval compared to fingerprints and also has aninferior virtual screening performance to fingerprints. The datafusion techniques, from their BEDROC and EF1% values, showcompromises in diversity retrieval. Of note, FP HS and MCSHS for all three data sets have significantly lower F1% valuesthan FP.Table 2 shows the performance times of the hyperstructure

and MCS searches. The typical search time requirement for

hyperstructures has a mean of 82.6 times greater than thanfingerprint searches, but the hyperstructure searches areconsistently faster than MCS. The fraction of time requiredfor hyperstructure searches compared to MCS is between 0.49and 0.8 with a mean of 0.638, depending on the data set andactivity class.The physicochemical properties shown in Table 3 highlight

some statistically significant differences in the moleculesbetween hyperstructure, MCS, and fingerprint searches. Ofimmediate note is that hyperstructures retrieve significantlylarger (more atoms) active molecules than fingerprints. Inaddition to being larger, the molecules are more complex andpossess larger acyclic chains but have no significant difference inheteroatom and ring count. This implies that the molecules areless feature-rich and more chain-rich. By contrast, the MCS-

retrieved active molecules are smaller and less chain-rich,although also possessing significantly less heteroatoms.In the MDDR data set, the renin−angiotensin class is the

most intrinsically similar class, whereas cyclooxygenaseinhibitors are the most diverse. This is on the basis of themean pairwise similarities between all pairs of active moleculesfor said activity classes using Unity 2D fingerprints and theTanimoto coefficient.6 Table 4 shows a strong lack of overlap in

the retrieved actives of the techniques presented, both for thetwo activity classes as well as with the mean values. Of note isthat FP and MCS have a greater overlap with each other thanwith the hyperstructure searches. These observations are morepronounced with the cyclooxygenase inhibitors than with renin,although even with the former there is still little overlap.One of the potential attractions of the HS approach is that it

may provide a way of identifying novel scaffolds. The top five-ranked actives involved in the MDDR COX inhibitors searchare shown in Figure 6b, with the reference structures fromwhich the hyperstructure was constructed being in Figure 6a.Comparisons of these two will reveal that the search hasidentified three scaffolds not present in the reference structures(actives 56, 346, and 485). Of note, the top five compoundsretrieved by FP (Figure 6c) all share the same scaffold (with r9)and generally differ from each other by just one atom. Thesimilarities are also evident from Table 5, where the FPsimilarities for the FP-retrieved actives are much higher thanthose retrieved by HS. This ability to prioritise analogues overnonanalogues is a major reason why the fingerprint-MAX ruleoutperforms HS (and it is unsurprising this works so well given

Table 2. Time Statistics on Constructed Hyperstructures forEach Class in the MDDR Data Seta

target ID FPS HSC HSS MCSS

6233 1.5 × 104 1.5 × 103 9.1 × 105 1.8 × 106

6235 5.9 × 104 1.0 × 103 1.4 × 106 2.3 × 106

6245 1.4 × 104 6.3 × 102 1.1 × 106 2.0 × 106

7701 1.5 × 104 9.0 × 102 1.5 × 106 2.1 × 106

31420 3.4 × 104 1.6 × 103 2.4 × 106 4.9 × 106

31432 1.9 × 104 1.6 × 103 2.1 × 106 3.4 × 106

37110 1.7 × 104 1.0 × 103 1.9 × 106 2.9 × 106

42731 4.4 × 104 2.3 × 103 3.2 × 106 4.0 × 106

71523 2.4 × 104 2.3 × 103 3.0 × 106 4.0 × 106

78331 2.0 × 104 1.1 × 103 1.3 × 106 2.2 × 106

78374 2.1 × 104 1.0 × 103 2.0 × 106 2.6 × 106

aHSC is the hyperstructure construction time. FPS, HSS, and MCSS aretimes for fingerprint, hyperstructure, and MCS searching, respectively.All times in this table are in milliseconds.

Table 3. Mean Values of Physicochemical Properties of Actives of Top 1% Ranked Lists for Given Methodsa

method complexity largest chain rotatable bonds heavy atoms heteroatoms rings

FP 3570 15.022 11.511 32.253 7.714 3.874MCS 3146 13.188 10.554 29.995 6.793 3.655HS 3969 16.469 12.486 33.913 7.632 3.873p_MCS 0.002 0.005 0.054 0.005 0.005 0.175p_HS 0.010 0.042 0.083 0.032 1.000 0.765

aValues with the prefix “p_” reflect the p-values from paired 2-tailed Wilcoxon signed rank tests, tested against FP.

Table 4. Complementarity of Search Methodsa

(a) renin−angiotensin inhibitors

method MCS FP HS

MCS 1.000 0.312 0.352FP 0.609 1.000 0.583HS 0.293 0.248 1.000

(b) cyclooxygenase inhibitors

method MCS FP HS

MCS 1.000 0.329 0.105FP 0.392 1.000 0.105HS 0.016 0.013 1.000(c) mean values across all classes ± one standard deviation.

method MCS FP HS

MCS 1.000 ± 0.000 0.445 ± 0.122 0.153 ± 0.085FP 0.268 ± 0.074 1.000 ± 0.000 0.123 ± 0.074HS 0.230 ± 0.087 0.317 ± 0.163 1.000 ± 0.000

aEach cell gives the proportion of identified active compounds incommon with the two methods divided by the number of activesidentified by the method in the column.

Journal of Chemical Information and Modeling Article

DOI: 10.1021/ci5005702J. Chem. Inf. Model. 2015, 55, 222−230

227

Page 7: Maximum Common Substructure-Based Data Fusion in Similarity …€¦ · applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the

that 10 molecules with different scaffolds are used as thereference set).

■ CONCLUSIONS

The results of this investigation showed that both hyper-structures and MCS fusion, at least for the data sets used, areinferior to a conventional fingerprinting method for performingvirtual screening in terms of enrichment and scaffold retrieval.MCS group fusion is significantly slower than hyperstructure-based similarity searching, albeit with a slight gain in virtualscreening performance. Although hyperstructures retrieve fewerscaffolds than fingerprints, they retrieve a better spread ofcompounds across scaffolds compared to all the othertechniques tested, implying that they are less likely to findanalogues than MCS and fingerprint group fusion techniques.MCS and, in particular, hyperstructure searches have lowoverlaps with fingerprints in the active compounds retrieved.

The physicochemical properties of the actives retrieved oftendiffer between the three techniques as well. Compared tofingerprints, hyperstructures tend to retrieve larger moleculeswith greater chains and fewer heteroatoms, whereas theopposite is observed with MCS searches. Data fusion of thetechniques used in this study generally yields compromises invirtual screening performance. All fusion techniques hereoutperformed MCS and hyperstructure-based searches alonebut failed to outperform fingerprint searching.The results above demonstrate clearly that fingerprints out-

perform hyperstructure searches in terms of numbers ofretrieved actives. One obvious reason for this behavior is thefingerprints’ ability to identify large numbers of close analoguesto an entire reference structure, something that is much moredifficult for a hyperstructure that has been constructed frommultiple individual molecules, especially when these arestructurally disparate (as is the case here). It should also be

Figure 6. Compounds involved in the MDDR COX inhibitors search. Panel (a) shows the molecules used to construct the hyperstructure(numbered arbitrarily). Panel (b) shows the top five-ranked active compounds retrieved by the hyperstructure, with their ranks displayed. Panel (c)shows the top five-ranked active compounds retrieved by FP, with their ranks displayed.

Journal of Chemical Information and Modeling Article

DOI: 10.1021/ci5005702J. Chem. Inf. Model. 2015, 55, 222−230

228

Page 8: Maximum Common Substructure-Based Data Fusion in Similarity …€¦ · applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the

noted that the baseline of comparison is a type of fingerprintthat is known to be extremely effective for virtual screening.Finally, the MCS algorithm used in this work is inexact and alsogenerates only a single MCS, even if several different (andpotentially more chemically feasible) ones are possible. Thesefactors influence the performance of the hyperstructureconcept, both in hyperstructure construction and similaritysearching. It would thus be worth investigating potentialperformance changes using alternative MCS algorithms.

■ AUTHOR INFORMATIONCorresponding Authors*E-mail: [email protected].*E-mail: [email protected].*E-mail: [email protected] authors declare no competing financial interest.

■ ACKNOWLEDGMENTSThe authors thank Peter Englert for information on the featuresof the JChem MCS algorithm, John May for usage advice onthe Chemistry Development Kit, and the University of Sheffieldfor the Faculty Scholarship funding and hardware used in thisPh.D. project.

■ REFERENCES(1) Maggiora, G.; Vogt, M.; Stumpfe, D.; Bajorath, J. MolecularSimilarity in Medicinal Chemistry. J. Med. Chem. 2014, 57, 3186−3204.(2) Willett, P. Similarity Methods in Chemoinformatics. Annu. Rev.Inform. Sci. 2009, 43, 1−117.(3) Willett, P. Similarity-based Virtual Screening Using 2DFingerprints. Drug DiscoveryToday 2006, 11, 1046−1053.(4) Todeschini, R.; Consonni, V. Molecular Descriptors for Chemo-informatics, 2nd ed.; Wiley- VCH: Weinheim, 2009.(5) Willett, P. Combination of Similarity Rankings Using DataFusion. J. Chem. Inf. Model. 2013, 53, 1−10.(6) Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby,E.; Schuffenhauer, A. Comparison of Topological Descriptors forSimilarity-based Virtual Screening Using Multiple Bioactive ReferenceStructures. Org. Biomol. Chem. 2004, 2, 3256.(7) Shemetulskis, N. E.; Weininger, D.; Blankley, C. J.; Yang, J. J.;Humblet, C. Stigmata: An Algorithm To Determine Structural

Commonalities in Diverse Datasets. J. Chem. Inf. Comput. Sci. 1996,36, 862−871.(8) Bunke, H.; Jiang, X.; Kandel, A. On the Minimum CommonSupergraph of Two Graphs. Computing 2000, 65, 13−25.(9) Dubois, J. E.; Laurent, D.; Aranda, A. Methode de PerturbationD’Environnements Limites Concentriques Ordonnes (PELCO). J.Chim. Phys. 1973, 70, 1608−1615.(10) Menon, G. K.; Cammarata, A. Pattern Recognition II:Investigation of StructureActivity Relationships. J. Pharm. Sci.1977, 66, 304−314.(11) Simon, Z.; Badilescu, I.; Racovitan, T. Mapping ofDihydrofolate-Reductase Receptor Site by Correlation with MinimalTopological (steric) Differences. J. Theor. Biol. 1977, 66, 485−495.(12) Vladutz, G.; Gould, S. R. Chemical Structures: The InternationalLanguage of Chemistry; Springer Verlag: Berlin, 1988; Vol. 1, pp 371−384.(13) Brown, R. D.; Downs, G. M.; Willett, P.; Cook, A. P. F.Hyperstructure Model for Chemical Structure Handling: Generationand Atom-by-Atom Searching of Hyperstructures. J. Chem. Inf.Comput. Sci. 1992, 32, 522−531.(14) Brown, N. Generation and Application of Activity-weightedChemical Hyperstructures. Ph.D. Thesis, University of Sheffield,Sheffield, 2002.(15) Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kotter, T.;Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. In Data Analysis,Machine Learning and Applications; Preisach, C., Burkhardt, P. D. H.,Schmidt-Thieme, P. D. L., Decker, P. D. R., Eds.; Studies inClassification, Data Analysis, and Knowledge Organization; Springer:Berlin Heidelberg, 2008; pp 319−326.(16) Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kotter, T.;Meinl, T.; Ohl, P.; Thiel, K.; Wiswedel, B. KNIME - The KonstanzInformation Miner, Version 2.0 and Beyond. SIGKDD Explor. 2009,11, 26−31.(17) Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.;Willighagen, E. The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics. J. Chem. Inf.Comput. Sci. 2003, 43, 493−500.(18) R: A Language and Environment for Statistical Computing; RFoundation for Statistical Computing: Vienna, Austria, 2008.(19) Rohrer, S. G.; Baumann, K. Maximum Unbiased Validation(MUV) Data Sets for Virtual Screening Based on PubChemBioactivity Data. J. Chem. Inf. Model. 2009, 49, 169−184.(20) Arif, S. M.; Holliday, J. D.; Willett, P. Analysis and Use ofFragment-occurrence Data in Similarity-based Virtual Screening. J.Comput. Aided Mol. Des. 2009, 23, 655−668.(21) Ashton, M.; Barnard, J.; Casset, F.; Charlton, M.; Downs, G.;Gorse, D.; Holliday, J.; Lahana, R.; Willett, P. Identification of DiverseDatabase Subsets Using Property-Based and Fragment-Based Molec-ular Descriptions. QSAR 2002, 21, 598−604.(22) Rogers, D.; Hahn, M. Extended-connectivity Fingerprints. J.Chem. Inf. Model. 2010, 50, 742−754.(23) Raymond, J. W.; Willett, P. Maximum Common SubgraphIsomorphism Algorithms for the Matching of Chemical Structures. J.Comput. Aided Mol. Des. 2002, 16, 521−533.(24) Kovacs, P.; Englert, P. MaxCommonSubstructure (JChem APIDocumentation (c) 1998−2013 ChemAxon, Ltd.), 2013. http://www.chemaxon.com/jchem/doc/dev/java/api/com/chemaxon/search/mcs/MaxCommonSubstructure.html (accessed January 2015).(25) Kovacs, P.; Englert, P. Making the Most of ApproximateMaximum Common Substructure Search, 2014. http://www.slideshare.net/penglert/mcs-poster (accessed January 2015).(26) Englert, P.; Kovacs, P. Making the Most of ApproximateMaximum Common Substructure Search. J. Cheminform. 2014, 6, P29.(27) Brown, R. D.; Downs, G. M.; Jones, G.; Willett, P.Hyperstructure Model for Chemical Structure Handling: Techniquesfor Substructure Searching. J. Chem. Inf. Comput. Sci. 1994, 34, 47−53.(28) Wang, Y.; Eckert, H.; Bajorath, J. Apparent Asymmetry inFingerprint Similarity Searching Is a Direct Consequence of

Table 5. Similarities of Molecules in Figure 6a

(a) retrieved actives with HS (Figure 6b)

molecule similarity ref

56 0.260 r7346 0.228 r2353 0.570 r1484 0.685 r9485 0.126 r2

(b) retrieved actives with FP (Figure 6c)

molecule similarity ref

2 0.769 r93 0.756 r94 0.732 r95 0.732 r96 0.718 r9

aThe similarity reported uses the Tanimoto coefficient with the MAXfusion rule to the reference molecules, the reference molecule beingquoted. The fingerprint used is as with the FP method.

Journal of Chemical Information and Modeling Article

DOI: 10.1021/ci5005702J. Chem. Inf. Model. 2015, 55, 222−230

229

Page 9: Maximum Common Substructure-Based Data Fusion in Similarity …€¦ · applications of the MCS will be focused on here. Typically, the number of bonds in the MCS, as well as the

Differences in Bit Densities and Molecular Size. ChemMedChem. 2007,2, 1037−1042.(29) Horvath, D.; Marcou, G.; Varnek, A. Do Not Hesitate To UseTverskyAnd Other Hints for Successful Active Analogue Searcheswith Feature Count Descriptors. J. Chem. Inf. Model. 2013, 53, 1543−1562.(30) Senger, S. Using Tversky Similarity Searches for Core Hopping:Finding the Needles in the Haystack. J. Chem. Inf. Model. 2009, 49,1514−1524.(31) Truchon, J.-F.; Bayly, C. I. Evaluating Virtual ScreeningMethods: Good and Bad Metrics for the “Early Recognition” Problem.J. Chem. Inf. Model. 2007, 47, 488−508.(32) Riniker, S.; Landrum, G. A. Open-source Platform toBenchmark Fingerprints for Ligand-Based Virtual Screening. J.Cheminform. 2013, 5, 26.(33) Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs.1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887−2893.(34) Mackey, M. D.; Melville, J. L. Better than Random? TheChemotype Enrichment Problem. J. Chem. Inf. Model. 2009, 49, 1154−1162.(35) Nilakantan, R.; Nunn, D. S.; Greenblatt, L.; Walker, G.; Haraki,K.; Mobilio, D. A Family of Ring System-Based Structural Fragmentsfor Use in Structure−Activity Studies: Database Mining and RecursivePartitioning. J. Chem. Inf. Model. 2006, 46, 1069−1077.(36) Raymond, J. W.; Willett, P. Effectiveness of Graph-Based andFingerprint-Based Similarity Measures for Virtual Screening of 2DChemical Structure Databases. J. Comput.-Aided Mol. Des. 2002, 16,59−71.(37) Gardiner, E. J.; Holliday, J. D.; O’Dowd, C.; Willett, P.Effectiveness of 2D Fingerprints for Scaffold Hopping. Future. Med.Chem. 2011, 3, 405−414.(38) Vogt, M.; Stumpfe, D.; Geppert, H.; Bajorath, J. ScaffoldHopping Using Two- Dimensional Fingerprints: True Potential, BlackMagic, or a Hopeless Endeavor? Guidelines for Virtual Screening. J.Med. Chem. 2010, 53, 5707−5715.

Journal of Chemical Information and Modeling Article

DOI: 10.1021/ci5005702J. Chem. Inf. Model. 2015, 55, 222−230

230


Recommended