+ All Categories
Home > Documents > Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

Date post: 01-Oct-2016
Category:
Upload: greg-m
View: 239 times
Download: 8 times
Share this document with a friend
12
Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure Philip D. Mosier, ² Peter C. Jurs,* Laura L. Custer, Stephen K. Durham, and Greg M. Pearl Department of Chemistry, The Pennsylvania State University, University Park, Pennsylvania 16802, and Bristol-Myers Squibb Company, Princeton, New Jersey 08543 Received November 13, 2002 We report several binary classification models that directly link the genetic toxicity of a series of 140 thiophene derivatives with information derived from the compounds’ molecular structure. Genetic toxicity was measured using an SOS Chromotest. IMAX (maximal SOS induction factor) values were recorded for each of the 140 compounds both in the presence and in the absence of S9 rat liver homogenate. Compounds were classified as genotoxic if IMAX g 1.5 in either test or nongenotoxic if IMAX < 1.5 for both tests. The molecular structures were represented by numerical descriptors that encoded the topological, geometric, electronic, and polar surface area properties of the thiophene derivatives. The classification models used were linear discriminant analysis (LDA), k-nearest neighbor classification (k-NN), and the probabilistic neural network (PNN). These were used in conjunction with either a genetic algorithm or a generalized simulated annealing to find optimal subsets of descriptors for each classifier. The quality of the resulting models was determined by the number of misclassified compounds, with preference given to models that produced fewer false negative classifications. Model sizes ranged from seven descriptors for LDA to three descriptors for k-NN and PNN. Very good classification results were obtained with all three classifiers. Classification rates for the LDA, k-NN, and PNN models were 80, 85, and 85%, respectively, for the prediction set compounds. Additionally, a consensus model was generated that incorporated all three of the basic model types. This consensus model correctly predicted the genotoxicity of 95% of the prediction set compounds. Introduction Modern drug discovery efforts have been focusing upon utilization of high throughput screening coupled with the concurrent assessment of a compound’s advantages and liabilities at early stages in the development process. The use of predictive in silico models is at the forefront of several strategies for identifying liabilities early in the drug discovery process (1-9). Particularly useful is predicting toxicological liabilities, such as carcinogenicity, mutagenicity, hepatotoxicity, and teratogenicity, because in vivo and in vitro toxicity testing requires a substantial investment of both time and money. The current para- digm for predictive in silico toxicity assessment is to utilize toxic biophores or associated chemical structures (10-13) to predict general toxic liabilities, such as carcinogenicity. Depending upon the methods used for generating these toxic biophores, the subsequent alerts are generally either too sensitive or too specific to have a significant impact upon the drug discovery process. However, previous studies using various pattern recogni- tion techniques involving the modeling of carcinogenicity (14-17), acute toxicity (18-21), and genotoxicity (22) have proven successful. To develop a predictive toxicity model, one should select a data set that is limited to a predominating toxic mechanism. To accomplish this, we have chosen to model the genotoxicity liability of thiophenes. Thiophenes were selected due to their presence in pharmaceuticals and their potential, via CYP450 metabolism, to form reactive metabolites such as epoxides and sulfoxides (23-26) as illustrated in Figure 1. Epoxides have been shown to be genotoxic (27-29). The mechanism of genotoxicity is believed to result from DNA alkylation by a reactive ion following ring opening. Subsequent cell replication con- ² The Pennsylvania State University. Bristol-Myers Squibb Company. Figure 1. Proposed thiophene metabolism and genotoxic mechanism of action. Thiophene is oxidized enzymatically by members of the cytochrome P-450 (CYP 450) multifunction oxidases, resulting in the formation of thiophene epoxides and sulfoxides. These species are susceptible to nucleophilic attack by DNA nucleobases to form DNA adducts. When the damaged DNA is replicated, mutations occur in the newly formed DNA molecules as a result of the damaged parent DNA strand. 721 Chem. Res. Toxicol. 2003, 16, 721-732 10.1021/tx020104i CCC: $25.00 © 2003 American Chemical Society Published on Web 05/07/2003
Transcript
Page 1: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

Predicting the Genotoxicity of Thiophene Derivativesfrom Molecular Structure

Philip D. Mosier,† Peter C. Jurs,*,† Laura L. Custer,‡ Stephen K. Durham,‡ andGreg M. Pearl‡

Department of Chemistry, The Pennsylvania State University, University Park, Pennsylvania 16802,and Bristol-Myers Squibb Company, Princeton, New Jersey 08543

Received November 13, 2002

We report several binary classification models that directly link the genetic toxicity of aseries of 140 thiophene derivatives with information derived from the compounds’ molecularstructure. Genetic toxicity was measured using an SOS Chromotest. IMAX (maximal SOSinduction factor) values were recorded for each of the 140 compounds both in the presenceand in the absence of S9 rat liver homogenate. Compounds were classified as genotoxic if IMAXg 1.5 in either test or nongenotoxic if IMAX < 1.5 for both tests. The molecular structureswere represented by numerical descriptors that encoded the topological, geometric, electronic,and polar surface area properties of the thiophene derivatives. The classification models usedwere linear discriminant analysis (LDA), k-nearest neighbor classification (k-NN), and theprobabilistic neural network (PNN). These were used in conjunction with either a geneticalgorithm or a generalized simulated annealing to find optimal subsets of descriptors for eachclassifier. The quality of the resulting models was determined by the number of misclassifiedcompounds, with preference given to models that produced fewer false negative classifications.Model sizes ranged from seven descriptors for LDA to three descriptors for k-NN and PNN.Very good classification results were obtained with all three classifiers. Classification ratesfor the LDA, k-NN, and PNN models were 80, 85, and 85%, respectively, for the prediction setcompounds. Additionally, a consensus model was generated that incorporated all three of thebasic model types. This consensus model correctly predicted the genotoxicity of 95% of theprediction set compounds.

IntroductionModern drug discovery efforts have been focusing upon

utilization of high throughput screening coupled with theconcurrent assessment of a compound’s advantages andliabilities at early stages in the development process. Theuse of predictive in silico models is at the forefront ofseveral strategies for identifying liabilities early in thedrug discovery process (1-9). Particularly useful ispredicting toxicological liabilities, such as carcinogenicity,mutagenicity, hepatotoxicity, and teratogenicity, becausein vivo and in vitro toxicity testing requires a substantialinvestment of both time and money. The current para-digm for predictive in silico toxicity assessment is toutilize toxic biophores or associated chemical structures(10-13) to predict general toxic liabilities, such ascarcinogenicity. Depending upon the methods used forgenerating these toxic biophores, the subsequent alertsare generally either too sensitive or too specific to havea significant impact upon the drug discovery process.However, previous studies using various pattern recogni-tion techniques involving the modeling of carcinogenicity(14-17), acute toxicity (18-21), and genotoxicity (22)have proven successful.

To develop a predictive toxicity model, one should selecta data set that is limited to a predominating toxicmechanism. To accomplish this, we have chosen to modelthe genotoxicity liability of thiophenes. Thiophenes were

selected due to their presence in pharmaceuticals andtheir potential, via CYP450 metabolism, to form reactivemetabolites such as epoxides and sulfoxides (23-26) asillustrated in Figure 1. Epoxides have been shown to begenotoxic (27-29). The mechanism of genotoxicity isbelieved to result from DNA alkylation by a reactive ionfollowing ring opening. Subsequent cell replication con-

† The Pennsylvania State University.‡ Bristol-Myers Squibb Company.

Figure 1. Proposed thiophene metabolism and genotoxicmechanism of action. Thiophene is oxidized enzymatically bymembers of the cytochrome P-450 (CYP 450) multifunctionoxidases, resulting in the formation of thiophene epoxides andsulfoxides. These species are susceptible to nucleophilic attackby DNA nucleobases to form DNA adducts. When the damagedDNA is replicated, mutations occur in the newly formed DNAmolecules as a result of the damaged parent DNA strand.

721Chem. Res. Toxicol. 2003, 16, 721-732

10.1021/tx020104i CCC: $25.00 © 2003 American Chemical SocietyPublished on Web 05/07/2003

Page 2: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

verts alkylated DNA into permanent gene mutations.An appropriate data set should be used for the model

generation that adequately represents chemical space.The neighborhood around the hypothesized toxic biophoreshould be represented, while excluding compounds withknown biophores associated with the toxicity underinvestigation. These data set limitations have typicallymade it impossible to develop local predictive toxicitymodels because there is insufficient data available in thepublished literature. Assuming that sufficient data ex-isted, serious questions remain regarding the combina-tion of data produced under divergent protocols inmultiple laboratories and methods used to assign com-pounds to either training or predictive data sets. Toalleviate these modeling issues, a thiophene data set wascreated from commercially available compounds. Thesecompounds were tested in the SOS Chromotest to assesspotential genotoxicity.

Experimental ProceduresThe process used to develop the models described here

consisted of five stages: (i) selection and testing of compounds,(ii) structure entry and optimization, (iii) descriptor generation,(iv) objective feature selection, and (v) subjective feature selec-tion. A general description of each of these steps is given here.

Selection and Testing of Compounds. The compoundsused for the model development were selected based upon (i)availability from Aldrich Chemical Co. and (ii) passing aLipinski drug likeness filter (30) in which the original “rule offive” ranges were expanded in order to classify more compoundsas “druglike”. This was required in order to obtain a sufficientnumber of compounds with which to build models of genotox-icity. The purest form of the compounds was selected for runningin the assay, and most of the compounds obtained had >98%purity. The data set is comprised of compounds selected basedupon chemical diversity and supplemented by manual additionof chemically interesting groups. Chemical diversity was achievedby clustering the electrotopological indices, molecular weight,and various atom and ring counts. Compound selection was thenlimited to three compounds from a specific class. The compoundschosen from the clusters were limited to five example com-pounds per cluster where priority was given to compounds thatdid not contain genotoxic alerts, as predicted by DEREK 3.6(10). DEREK (Deductive Estimation of Risk from ExistingKnowledge) is a rule-based expert system that identifies chemi-cal substructures that are responsible for various types oftoxicity. The data set was also enriched by adding simplechemical substitutions; an example of these additions would beto include examining the effect of methyl- vs tert-butyl-substituted thiophenes or effects of halogen substitution. In all,140 compounds were used to build the classification modelsdescribed here (Table 1).

The SOS Chromotest was used to determine the genotoxicliability for each of the compounds in the thiophene data set.The assay measures induction of a lacZ reporter gene inresponse to DNA damage (31). The SOS pathway plays a leadingrole in the way Escherichia coli respond to genotoxic damage(32). Because this pathway responds to a broad spectrum ofgenotoxic substances, SOS induction can be used as an earlymonitor for DNA damage. E. coli were modified with a lacZreporter gene under transcriptional control of an SOS repairgene. Normally, SOS repair genes are repressed, but in responseto DNA damage, these genes are induced resulting in productionof â-galactosidase, the gene product of lacZ. Fold increases ingene induction are determined by measuring â-galactosidaseactivity using o-nitrophenyl-â-D-galactopyranoside. The assayhas been used extensively with many different chemical classes.A review of published data between 1982 and 1992 demon-strated that for 1776 compounds the SOS Chromotest had 90%concordance with the Ames mutagenicity test (33). IMAX(maximal SOS induction factor) values were measured for each

of the 140 thiophene derivatives with and without S9 (rat liverhomogenate) metabolic transformation. For the S9-activatedassay, IMAX values ranged from 0.90 to 6.08 with a mean of1.28 and a median value of 1.08. In the assay lacking S9, IMAXvalues ranged from 0.88 to 8.01 with a mean of 1.35 and amedian value of 1.19. In each case, the minimum IMAX valuewas recorded for 2-thiophenecarboxylic acid and the maximumvalue was recorded for 2-nitrothiophene. Compounds wereclassified as genotoxic if IMAX g 1.5 in either test or nongeno-toxic if IMAX < 1.5 for both tests. This resulted in 39 of the140 compounds (28%) being defined as genotoxic.

Structure Entry and Optimization. In the structure entryand optimization stage, the data relating to each of the selectedmolecules’ atom types, bond types, and bond connection tableswere saved as MACCS molecule (.mol) files using the DIVA 2.0(Accelrys, San Diego, CA) computer program. Each molecule wasassigned a three-dimensional conformation using the CON-CORD software of Pearlman (34). The 140 compounds used arelisted in Table 1. The molecules were then transferred to a UNIXworkstation where the MOPAC 6.0 (35) molecular orbitalsoftware package was used to assign atomic charges to each ofthe molecules used in this study using the AM1 Hamiltionian(36). Four of the structures contained counterions or protonatedamines or were drawn incorrectly, and they are indicated inTable 1. These molecules were resketched using the molecularmodeling program HyperChem (HyperCube, Inc., Waterloo, ON)and given a rough three-dimensional conformation by first usingthe HyperChem model building feature followed by the MM+molecular mechanics force field with the default options. A low-energy conformation was assigned to the resketched structuresusing the AM1 semiempirical molecular orbital methods withinHyperChem. As before, the molecules were stored in individual.mol files and assigned atomic charges using MOPAC. Thestructures were then ready to be used with ADAPT1 (37, 38).The entire set of compounds was divided into two subsets: atraining set (TSET), whose information was used to build theactual models, and a prediction set (PSET), consisting ofmolecules not found in the TSET, which was used to validatethe models once they were built. Members of each set wereassigned randomly. The training set consisted of 120 compounds(85.7%), and the prediction set contained 20 compounds (14.3%).As an added precaution, it was verified that each set containedroughly the same percentage of genotoxic compounds (TSET )27.5% genotoxic, PSET ) 30% genotoxic).

Descriptor Generation. In the descriptor generation stage,various properties of the molecules were calculated and storedusing routines found in the ADAPT package. Four classes ofdescriptors (39) were generated in each of the studies presentedhere: topological, geometric, electronic, and polar surface area.Topological descriptors are based on graph theory and encodeinformation about the types of atoms and bonds in a moleculeand the nature of their connections. Examples of topologicaldescriptors include counts of atom and bond types and indexesthat encode the size, shape, and types of branching in a molecule(40-48). Geometric descriptors encode information about thethree-dimensional nature of the molecule. Examples of geomet-ric descriptors are solvent accessible surface areas (49), momentsof inertia, and shadow areas (50). Electronic descriptors encodethe electronic character of the molecule. Examples includehighest occupied molecular orbital (HOMO) and lowest unoc-cupied molecular orbital (LUMO) energies, dipole moments, andatomic charges. These are extracted from MOPAC output files.Finally, polar surface area descriptors are combinations of thesolvent accessible surface area (a geometric property) and atomiccharge (an electronic property). We encoded polar surface areainformation using the charged partial surface area (CPSA)descriptors (51). By taking combinations of atomic surface areas

1 Abbreviations: QSTR, quantitative structure-toxicity relation-ship; ADAPT, automated data analysis and pattern recognition toolkit;GA, genetic algorithm; GSA, generalized simulated annealing; k-NN,k-nearest neighbor; LDA, linear discriminant analysis; PNN, proba-bilistic neural network.

722 Chem. Res. Toxicol., Vol. 16, No. 6, 2003 Mosier et al.

Page 3: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

Table 1. Data of 140 Thiophene Derivatives Used in This Study; Incorrect Predictions Are Shown in Parentheses

genetic toxicity

measuredb calculatedc

ID seta name CAS number +S9 -S9 L K P C

1 T acetazolamide 59-66-5 - - - - - -2 T tenoxicam 59804-37-4 - - - (+) - -3 T thiabendazole 148-79-8 + - + + + +4 T 2,2′-thenil 7333-07-5 - - - - - -5 T 2,3-dihydrothieno-(3,4-B)-1,4-dioxin 126213-50-1 + - + (-) + +6 T 2,5-thiophenedicarboxaldehyde 932-95-6 - - (+) - - -7 T 2-(4-aminophenyl)-6-methylbenzothiazole 92-36-4 - - - - (+) -8 T 2-((5-(dibutylamino)-2-thienyl)methylene)-1H-indene-1,3-(2H)-dione 212632-34-3 - - - - (+) -9 T 2-(dimethylaminomethyl)thiophene 26019-17-0 - - - - - -10 T 2-acetyl-5-bromothiophene 5370-25-2 + - + (-) + +11 T 2-amino-3,5-dinitrothiophene 2045-70-7 - - (+) (+) - (+)12 P 2-amino-4-chlorobenzothiazole 19952-47-7 - - (+) - - -13 T 2-amino-4-methoxybenzothiazole 5464-79-9 - + + + (-) +14 T 2-amino-4-methylbenzothiazole 1477-42-5 + - + + + +15 T 2-amino-5,6-dimethylbenzothiazole 29927-08-0 - - (+) - - -16 T 2-amino-6-flurobenzothiazole 348-40-3 - + + (-) (-) (-)17 T 2-amino-6-chlorobenzothiazole 95-24-9 - - (+) - (+) (+)18 P 2-amino-6-methoxybenzothiazole 1747-60-0 + - + + + +19 T 2-amino-6-methylbenzothiazole 2536-91-6 - - (+) - - -20 T 2-aminobenzothiazole 136-95-8 - - (+) - - -21 T 2-bromo-5-chlorothiophene 2873-18-9 + - + + + +22 T 2-bromothiophene 1003-09-4 - - - - - -23 T 2-chloro-5-(chloromethyl)thiophene 23784-96-5 - - - (+) (+) (+)24 T 2-chlorothiophene 96-43-5 - - - - - -25 T 2-propylthiophene 1551-27-5 - - - - - -26 T 2-thiopheneacetic acid 1918-77-0 - - - - - -27 P 2-thiophenecarboxaldehyde 98-03-3 + + + + + +28 P 2-thiophenecarboxamide 5813-89-8 - - - - - -29 P 2-thiophenemethanol 636-72-6 - - - - - -30 P 2-amino-6-(methylsulfonyl) benzothiazole 17557-67-4 - - - - - -31 T 2-amino-6-ethoxybenzothiazole 94-45-1 + + (-) + + +32 T 3,6,9,14-tetrathiabicyclo-(9.2.1)-tetradeca-11,13-diened 60147-18-4 - - - - (+) -33 T 3-acetyl-2,5-dichlorothiophene 36157-40-1 + + + + (-) +34 T 3-acetyl-2,5-dimethylthiophene 2530-10-1 + - + + + +35 T 3-bromo-2-chlorothiophene 40032-73-3 - - (+) - - -36 T 3-bromothianaphthene 7342-82-7 - - - - - -37 T 3-bromothiophene 872-31-1 - - - - - -38 T 3-methoxythiophene 17573-92-1 - - - - (+) -39 T 3-methyl-2-thiophenecarboxaldehyde 5834-16-2 + - + + + +40 T 3-thiopheneacetic acid 6964-21-2 - - - - - -41 T 3-thiopheneacetonitrile 13781-53-8 + + + + + +42 T 3-thiophenecarboxaldehyde 498-62-4 - + + + + +43 T 3-thiophenecarboxylic acid 88-13-1 - - - - - -44 P 3-thiophenemethanol 71637-34-8 - - - - - -45 T 4,6-diphenylthieno-(3,4-D)-(1,3)-dioxol-2-one-5,5-dioxide 54714-11-3 - - - (+) - -46 T 4-(2-thienyl)butyric acid 4653-11-6 - - - - - -47 P 4-bromo-2-thiophenecarboxaldehyde 18791-75-8 - - (+) - - -48 T 5-anilino-1,2,3,4-thiatriazole 13078-30-3 - - (+) - - -49 T 5-bromo-2-thiophenecarboxylic acid 7311-63-9 - - (+) - - -50 T 5-bromothiophene-2-carbaldehyde 4701-17-1 - - (+) (+) (+) (+)51 T 5-chloro-2-thiophenecarboxaldehyde 7283-96-7 + + + (-) + +52 P 5-ethyl-2-thiophenecarboxaldehyde 36880-33-8 - + + (-) + +53 P 5-methyl-2-thiophenecarboxylic acid 1918-79-2 - - - - - -54 T 5-nitro-2-thiophenecarboxaldehyde 4521-33-9 - + + + + +55 T 6-amino-2-mercaptobenzothiazole 7442-07-1 - - (+) - - -56 T cyclopropyl 2-thienyl ketone 6193-47-1 - - (+) - (+) (+)57 T coumarin 6 38215-36-0 - - - - - -58 T di-2-thienyl ketone 704-38-1 - + + (-) + +59 T dibenzothiophene-5,5-dioxide 1016-05-3 - - - - - -60 T ethyl 2-thiopheneacetate 57382-97-5 - - - (+) (+) (+)61 T ethyl 3-thiopheneacetate 37784-63-7 - + (-) (-) (-) (-)62 T ethyl 2-amino-4,5,6,7-tetrahydrobenzo-(B)-thiophene-3-carboxylate 4506-71-2 - - (+) - - -63 T N,N-dimethyl-N′-((5-nitro-2-thienyl)methylene)-1,4-phenylenediamine 126893-36-5 - - (+) (+) - (+)64 T N,N-dimethyl-4-(6-methylbenzothiazol-2-yl)aniline 10205-62-6 - - - - - -65 T suprofen 40828-46-4 - - - - - -66 T thianaphthene 95-15-8 - - - - - -67 T thieno-(3,2-B)-pyridin-7-ol 107818-20-2 - - (+) (+) - (+)68 T trans-3-(3-thienyl)acrylic acid 102696-71-9 - - - - - -69 T thiophene-2-carbonitrile 1003-31-2 + + + + + +70 T trans-2-(4-dimethylamino)styryl)benzothiazole 144528-14-3 - - - - - -71 T dibenzothiophene 132-65-0 - - - - - -72 T 2-benzoylthiophene 135-00-2 - - - - - -

Genotoxicity of Thiophene Derivatives Chem. Res. Toxicol., Vol. 16, No. 6, 2003 723

Page 4: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

Table 1 (Continued)

genetic toxicity

measuredb calculatedc

ID seta name CAS number +S9 -S9 L K P C

73 T 2,2′-bithiophene 492-97-7 - - - - - -74 T 2-phenylthiophene 825-55-8 - - - - - -75 T 2-thiopheneglyoxylic acid 4075-59-6 - - - - - -76 P 5,5′-dibromo-2,2′-biothiophene 4805-22-5 - - (+) - - -77 T diethyl 5-amino-3-methyl-2,4-thiophenedicarboxylate 4815-30-9 - - - - - -78 T methyl 3-amino-2-thiophene carboxylate 22288-78-4 - - - - - -79 T 3-thiophenemalonic acid 21080-92-2 - - - - - -80 T 3-phenylthiophene 2404-87-7 - - - - - -81 T 2,3-thiophenedicarboxaldehyde 932-41-2 + + + + + +82 T 2,5-dibromo-3-hexylthiophene 116971-11-0 - - - (+) - -83 P 3-(thianaphthen-3-yl)-L-alanine 308103-39-1 + - (-) + (-) (-)84 T 3,3′-bithiophene 3172-56-3 - - - - - -85 T 4,6-dimethyldibenzothiophene 1207-12-1 - - - - - -86 T 2,2′:5′,2′′-terthiophene-5,5′′-dicarboxaldehyde 13130-50-2 - - - - - -87 T 4-keto-4,5,6,7-tetrahydrothianaphthene 13414-95-4 + + + + (-) +88 T 5-methyl-2-thiophenecarboxaldehyde 13679-70-4 - - (+) - (+) (+)89 P nocodazole 31430-18-9 - - - - (+) -90 T 5-nitrothiophene-2-carboxylic acid 6317-37-9 - - - - - -91 T R-terthienyl 1081-34-1 - - - - - -92 T D-R-(2-thienyl)glycined 43189-45-3 - - - - - -93 T methapyrilene 135-23-9 + - (-) + + +94 T ticlopidine 55142-85-3 - - - - - -95 T cephalothin sodiume 58-71-9 - - - - - -96 T 2-acetylthiophene 88-15-3 - + + (-) (-) (-)97 T 2,5-bis(5-tert-butyl-2-benzoxazolyl)thiophene 7128-64-5 - - - - (+) -98 P 2-iodo-5-methylthiophene 16494-36-3 - - - - - -99 P 2-methylthianaphthene 1195-14-8 - - - - - -100 T 2-nitrothiophene 609-40-5 + + + + + +101 T 3-(2-thienyl)-L-alanine 22951-96-8 - - - - - -102 T 3-iodothiophene 10486-61-0 - - - - - -103 T 2-(trifluoroacetyl)thiophene 651-70-7 - - (+) - (+) (+)104 T 1-(2-thienyl)-1-propanone 13679-75-9 - - (+) - (+) (+)105 T 2-iodothiophene 3437-95-4 - - - - - -106 T 2-thiophenecarboxylic acid 527-72-0 - - - - - -107 P 2-thiopheneethylamine 30433-91-1 - - - - (+) -108 T 4-methyldibenzothiophene 7372-88-5 - - - - - -109 T 2-thiophenecarboxylic hydrazide 2361-27-5 - - - - - -110 T 2,3,5-tribromothiophene 3141-24-0 - + + + + +111 T 2-acetyl-3-methylthiophene 13679-72-6 - + + + + +112 T 2-acetyl-5-chlorothiophene 6310-09-4 + + + (-) + +113 T 3-methyl-2-thiophenecarboxylic acid 23806-24-8 - + + (-) + +114 T 2,5-diiodothiophene 625-88-7 + + + + + +115 T 2-bromo-3-methylthiophene 14282-76-9 - - - - - -116 T 3-bromo-4-methylthiophene 30318-99-1 - - - - - -117 P trans-2-(2-nitrovinyl)thiophene 34312-77-1 - - - (+) - -118 P 3-acetylthiophene 1468-83-3 + + + + + +119 T 2-thiopheneacetonitrile 20893-30-5 + - + + + +120 T 3-butylthiophene 34722-01-5 - - - - - -121 T 3-dodecylthiophene 104934-52-3 - - - - - -122 P ethyl 2-thiophenecarboxylate 2810-04-0 - + + (-) + +123 P 2,5-dibromo-3-decylthiophene 158956-23-1 - - - - - -124 T 2-(4-methoxybenzoyl)thiophene 4160-63-8 - - - - - -125 T 3,4-dibromothiophene 3141-26-2 - - (+) - - -126 T 2-(3-thienyl)ethanol 13781-67-4 - - - - - -127 T 2-(2-thienyl)ethanol 5402-55-1 - - - - - -128 T 2,3-dibromothiophene 3140-93-0 + - + (-) + +129 T 2,5-dibromo-3-dodecylthiophene 148256-63-7 - - - - - -130 T 2,5-dichlorothiophene 3172-52-9 + + + + + +131 T 2-bromo-5-methylthiophene 765-58-2 - - - - - -132 T 2,5-dibromothiophene 3141-27-3 + + + + + +133 T tetrachlorothiophene 6012-97-1 - - (+) (+) (+) (+)134 T 3-octylthiophene 65016-62-8 - - - - - -135 T L-R-(3-thienyl)glycine 1194-87-2 - - - - - -136 T L-R-(2-thienyl)glycine 65058-23-3 - - - - - -137 T cefoxitin sodium saltd,e 33564-30-6 - - - - (+) -138 T â-(2-thienyl)-D-alanine 139-86-6 - - - - - -139 T 3-chloroacetylbenzo-(B)-thiophene 26167-44-2 - - - - - -140 T dibenzothiophene-2,8-diylbis((N-carbonylmethylene)dimethylamine)d 35556-06-0 + + (-) + (-) (-)

a T, training set member; CV, cross-validation set member; P, prediction set member. b +S9, with S9; -S9, without S9. c L, LDA; K,k-NN; P, PNN; C, consensus. d Structure was redrawn prior to descriptor calculation. e Counterions have been replaced with hydrogenatoms.

724 Chem. Res. Toxicol., Vol. 16, No. 6, 2003 Mosier et al.

Page 5: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

and charges for specific atom types, CPSA descriptors canconvey information about the specific atom types (such as sulfur)or encode the hydrogen-bonding ability of a molecule. The CPSAdescriptors are related to the molecular polar surface areadescriptors of Stenberg et al. (52). A total of 250 descriptors werecalculated for each compound; 136 were topological in nature,33 were geometric, 33 were electronic, and 48 were polar surfacearea descriptors.

Objective Feature Selection. Objective feature selectionwas used to reduce the pool of 250 descriptors by eliminatingthose descriptors (features) that contained little or redundantinformation. Any descriptor whose values were identical for atleast 90% of the compounds in the training set was removeddue to insufficient information content. In addition, one of twodescriptors whose pairwise correlation coefficient exceeded 0.90for the training set members was also removed to eliminateredundant information. Objective feature selection is carried outusing only the independent variables (descriptors); the depend-ent variable (class membership) is not used. The resultingreduced pool consisted of 97 descriptors. Each of the descriptorswas subsequently linearly transformed from absolute to relativevalues on the range [0,1] such that 0 and 1 represent the lowestand highest descriptor values for the compounds of the TSET.The reduced pool was then submitted to the subjective featureselection routines for model development.

Subjective Feature Selection. Subjective feature selectionwas used to select subsets of descriptors from the reduced poolthat optimally modeled the genetic toxicity of the thiophenes.This was achieved through the use of a subjective featureselection algorithm, which selects subsets of descriptors coupledwith a fitness evaluator that determines the value of the costfunction (a measure of the overall performance of the model)associated with each of the descriptor subsets. For the modelspresented here, the subjective feature selection routines usedwere GSA and GA. The fitness evaluators used were lineardiscriminant analysis (LDA), k-nearest neighbor (k-NN), andprobabilistic neural network (PNN). A GA was used to selectdescriptor subsets for the k-NN and LDA fitness evaluators; aGSA feature selection algorithm was used to select descriptorsubsets for the PNN. For LDA and PNN, optimization involvedmodifying the adjustable parameters so that the predefined costfunction was minimized. Internal validation was used in thesefitness evaluators to prevent overtraining, a phenomenon inwhich the model learns the idiosyncrasies of the TSET membersand loses the ability to generalize. Depending on the particularmodel-building routine, this was achieved either by the leave-one-out (LOO) method (for k-NN) or by the leave-n-out method(for LDA). The subjective feature selection routines in each caseprovided a list of the top-performing models, and the predictivepower of these was assessed using the compounds of the PSET.Models whose PSET errors were comparable to or lower thantheir TSET error were considered to have good predictive ability.

GA. A GA (53, 54) is an evolutionary optimization routinethat is used to solve large, otherwise computationally intractableproblems, such as finding the optimal set of descriptors from areduced pool that describe the genotoxicity of a set of thiophenederivatives. The GA that was used in this study began bydefining a population of models, where each model was a“chromosome” containing a predefined number of descriptorsrepresenting the individual “genes”. Each model in the popula-tion was evaluated using a fitness evaluator (in this case LDAor k-NN), and the top-performing models were chosen for thesubsequent mating operation. In this study, single cross-overmating was employed to create two children models from tworandomly selected parent models. Random mutations were alsoimplemented, wherein a descriptor in about 5% of the modelswas randomly replaced with a different descriptor. The processof population mating and mutation, evaluation, and selectionwas repeated for 1000 iterations.

GSA. GSA (53, 55) is an optimization method that mimicsthe naturally occurring annealing process, in which the tem-perature of a system starts relatively high, then is slowly

lowered, leading to an optimal or highly ordered state. The GSAalgorithm used in this study began by selecting a random setof descriptors. The initial model was evaluated using the PNNfitness evaluator. One of the descriptors was then randomlyreplaced with another and was again evaluated by the PNN. Ifthe new model’s cost function was less than the current bestcost function, then the model was automatically kept. If the newmodel’s cost function was greater then the current best costfunction, then this detrimental step was kept if a randomlygenerated number between 0 and 1 was less than the probabilityP ) exp(-â∆C(ts)). Here â is the control parameter and ∆C(ts)is the difference in cost functions between the current and thebest models. The â value was initially chosen so that 80% ofthe models that represented detrimental steps were accepted,representing the higher initial temperature. Every 1000 itera-tions, â was increased by a factor of 2 to simulate the coolingprocess. The GSA algorithm ended when 900 iterations passedwithout an improvement in the error function or the algorithmhad run for 50 000 iterations.

LDA Model Development. The LDA (56, 57) model classi-fies compounds by dividing an n-dimensional descriptor spaceinto two regions that are separated by a hyperplane that isdefined by a linear discriminant function, eq 1.

Here, the x values are the descriptor values and the b valuesare weights associated with the respective descriptor values.The two regions formed by the hyperplane correspond to thetwo classes to which individual compounds are predicted tobelong. The compounds are assigned a value L by the discrimi-nant function and assigned one of the two classes based onwhether L is greater than or less than a cutoff score. This cutoffscore and the b values in eq 1 are determined using thecompounds of the TSET. The cost function used when selectingLDA models was that of Bakken and Jurs (58). This costfunction was designed to simultaneously ensure good modelgeneralizability and eliminate class bias within the model. Itwill be briefly described here. In the leave-n-out method, nrandomly selected TSET members (10% or 12 compounds in thiscase) were used as a cross-validation set (CVSET) and theremaining 90% were kept in the TSET. This process wasrepeated 10 times such that each TSET member appeared inthe CVSET exactly once. For each TSET-CVSET pair, the costfunction used was COST ) TSETw + 0.5 × |TSETw - CVSETw|.Here, TSETw is the percent incorrect classification for the mostpoorly predicted class in the TSET. CVSETw is defined in thesame way for the CVSET. The overall cost function used wasthe mean of the 10 individual cost functions for each descriptorsubset. The compounds of the PSET were then used to exter-nally validate the top-performing LDA models.

k-NN Model Development. The k-NN algorithm (59) is asimple way to assign classes to compounds based on the classof the compounds’ k-nearest neighbors, where k is an oddpositive integer. The nearest neighbors are determined usingthe standard Euclidean distance function in a d-dimensionalEuclidean descriptor space, where d is the number of descriptorsin the model. The value of k used in this study is 3. The costfunction used was the number of misclassified compounds inthe TSET. A LOO internal validation method was used whenbuilding the k-NN models. In the LOO method, one TSETmember was predicted at a time while using the remaining (n- 1) TSET members to generate the prediction. After training,the k-NN models were externally validated using the membersof the PSET. An advantage of the k-NN method is its simplic-ity: There are no parameters to optimize, and as such, thealgorithm is extremely fast. However, experience has shown thatk-NN tends to be biased in its predictions toward the morepopulous class.

PNN Model Development. The PNN (60-62), also knownas kernel discriminant analysis, is based on statistical prob-ability theory and is designed to address classification problems.

L ) b1x1 + b2x2 + ‚ ‚ ‚ + bkxk (1)

Genotoxicity of Thiophene Derivatives Chem. Res. Toxicol., Vol. 16, No. 6, 2003 725

Page 6: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

The PNN is a Bayes classifier implemented as a neural network.In a PNN, a joint probability density function (pdf) relating theinput descriptors to each of the output classes is approximatedusing the members of the TSET by a method attributed toParzen (63). In Parzen’s method, the class of an unknowncompound is determined by the number and proximity of TSETcompounds whose class is known. Each of the TSET compoundsis represented by a kernel function. The kernel function usedin this study is the Gaussian function exp(-x2/σ), where σ(sometimes called a “smoothing parameter”) determines thewidth of the kernel function and how much influence the nearbyTSET members have relative to the more distant TSET mem-bers. For the models presented here, a separate σ variable wasused for each input descriptor. The approximated pdf’s for eachclass are then used to determine the likelihood that an unknowncompound belongs to each of the possible classes. These likeli-hood values are known as output activations. In the PNN,training involves finding the optimal values for the σ variablesassociated with each of the input descriptors. The Polak-Ribiereconjugate gradient algorithm (64) combined with Brent’s lineminimization algorithm (65) was used to optimize these param-eters by minimizing a continuously defined cost function (eq 2)based on the individual class output activations.

In eq 2, TSET compound xI is known to belong to class k. Theb values are the output activations for each class. The first termin the numerator is the probability that xI does not belong tothe correct class k. The remaining numerator terms are theprobabilities that xI belongs to an incorrect class m. N is thenumber of compounds in the TSET. The PNN models were thenexternally validated using the members of the PSET.

Results and Discussion

Correlation with Epoxidizability and Sulfoxidiz-ability. It was hypothesized that the mechanism bywhich the thiophene derivatives exert their genotoxiceffects comes about as a result of epoxidation of thethiophene ring and/or by sulfoxidation of the thiophenesulfur atom (see Figure 1). A substructure search wasperformed that returned a count of the number ofepoxidizable sites on the molecule. Sixteen compoundswere determined not to be epoxidizable, and of these, fourwere genotoxic (25%). Of the remaining 124 epoxidizablecompounds, 35 were genotoxic (28%). Because the relativeoccurrence of epoxidizable compounds in each class wasroughly equal, the mechanism of genotoxicity was deter-mined not to be solely related to epoxidation. Also,because most of the compounds in the study contain onethiophene sulfur atom, substructure searches were notperformed using that substructure element as a meansto discriminate between genotoxic and nongenotoxicderivatives. Instead, CPSA descriptors included in thereduced pool that encode combinations of sulfur atomcharge and surface area were relied upon to encode thesulfoxidizability of the thiophene ring.

LDA Classification Models. Using a GA featureselection routine, LDA models containing from two to 12descriptors were examined for overall quality. The subsetof descriptors deemed to have the best performance ispresented in Table 2. The resulting seven descriptor LDAmodel confusion matrices are presented in Table 3. Thismodel was able to predict 67 of 87 nongenotoxic com-pounds correctly (77%) and 29 of 33 genotoxic compounds

correctly (88%) for the TSET. The overall prediction ratefor the TSET was 80% correct. The model predicted 11of 14 nongenotoxic compounds correctly (79%) and fiveof six genotoxic compounds correctly (83%) for the PSET.The overall prediction rate for the PSET was 80% correct.In this model, the genotoxic compounds are predictedcorrectly more often than the nongenotoxic compounds.This was not considered to be detrimental, as one of thegoals of this study was to minimize the number of falsenegatives where possible, while retaining an acceptableclassification rate for the nongenotoxic compounds. Pair-wise correlation coefficients (r values) among the 10descriptors ranged from -0.572 (ELEC-0 and FNSA-3)to 0.695 (FNSA-3 and SCAA-2). The mean of the absolutevalues of the correlation coefficients was 0.260.

This model contains all four descriptor subclasses;topological, geometric, electronic, and polar surface areadescriptors are all included. V6P-7 is the 6th ordervalence-weighted path connectivity (ø) index (66). Thisdescriptor encodes paths containing exactly six bondsconnecting heavy atoms. This index is valence-weighted,and the presence of heteroatoms will influence its value.In general, molecular structures containing more pathsof length six, fewer second principal quantum levelheteroatoms, and more nonsecond principal quantumlevel heteroatoms will have higher V6P-7 values. Themolecules in the thiophene data set with the smallestV6P-7 values contain less than seven bonds to heavyatoms; those with the largest V6P-7 values are structurescontaining multiple sulfur-containing rings. In examiningthe ranges of the genotoxic and nongenotoxic compoundsin Table 2, it was noted that no compound with a V6P-7value greater than 1.375 (14 compounds) was genotoxic.WTPT-5 is the sum of all molecular ID path weights forall paths starting from nitrogen atoms (47). The molec-ular ID path weights are sums of individual bondweights, which in turn are dependent upon the con-nectivity of the atoms that define the bond. This descrip-

COST )

∑I)1

N

[1 - bk (xI)]2 + ∑

m*k

[bm(xI)]2

N(2)

Table 2. Descriptors Included in the Selected GA-LDAModel

descriptorb typea coeff range: nontoxic range: toxic

V6P-7 T -0.0033 0.000 to 2.901 0.000 to 1.375WTPT-5 T 0.0025 0.000 to 11.70 0.000 to 9.416L/B-1 G -0.0109 1.056 to 2.885 1.089 to 1.742ELEC-0 E 0.0180 4.286 to 6.501 4.114 to 6.019FNSA-3 PSA 0.0467 -3.086 to -1.119 -1.650 to -1.734SCAA-2 PSA -0.0254 -25.83 to 21.81 -4.027 to 21.30SULF-5 PSA 0.0201 0.000 to 0.1786 0.0067 to 0.0869

a T, topological; G, geometric; E, electronic; PSA, polar surfacearea. b V6P-7, 6th order valence-weighted path ø index; WTPT-5,sum of molecular IDs path weights starting from nitrogen atoms;L/B-1, maximal length to breadth ratio; ELEC-0, electronegativity;FNSA-3, fractional charge-weighted partial negative surface area;SCAA-2, mean (surface area × charge) per H-bond acceptor atom;SULF-5, charge-weighted fractional surface area of sulfur atoms.

Table 3. Confusion Matrices for the Selected GA-LDAModel

predicted class

actual class nontoxic toxic % correct

training set 80.00nontoxic 67 20 77.01toxic 4 29 87.88

prediction set 80.00nontoxic 11 3 78.57toxic 1 5 83.33

726 Chem. Res. Toxicol., Vol. 16, No. 6, 2003 Mosier et al.

Page 7: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

tor is encoding the nitrogen atom content of the moleculeas well as the molecular branching environment aboutthe nitrogen atoms. The molecules with the lowestWTPT-5 values contain no nitrogen; those with thehighest WTPT-5 values contain several nitrogen atomsin rings. L/B-1 is the maximal length to breadth ratio.The molecules are considered to be rigid, and the firsttwo principal moments are aligned with the X- andY-axes. The molecule is then rotated about the Z-axis in10° increments. L/B-1 is the maximum ratio of themolecular extents along the X- and Y-axes. The moleculesgiving rise to the smallest L/B-1 ratios are thiophenessubstituted with small side chains. The largest L/B-1values are found for thiophenes with large and/or oppos-ing side chains. No genotoxic compound with an L/B-1value greater than 1.742 (25 compounds) was genotoxic.ELEC-0 is the molecular electronegativity and is calcu-lated as the mean of the HOMO and LUMO energies.No distinct linear trend was observable with this descrip-tor, although removing it from the model reduced theTSET and PSET classification rates to 66 and 55%,respectively. FNSA-3, SCAA-2, and SULF-5 are all CPSAdescriptors. In general, the CPSA descriptors convey theability of a molecule to interact with its surroundings viaelectronic interactions. Such interactions are importantcomponents of hydrogen bonding and lipophilicity (forexample log P) as well as more specific binding phenom-ena such as enzyme-ligand interactions. FNSA-3 is thefractional charge-weighted partial negative surface areaof the molecule. Structures with FNSA-3 less than-0.1650 (six compounds) were nongenotoxic. SCAA-2 isa hydrogen-bonding descriptor and is the mean (surfacearea × charge) per H-bond acceptor atom. Structureswith SCAA-2 less than -4.027 (11 compounds) werenongenotoxic. SULF-5 is the charge-weighted fractionalsurface area of sulfur atoms in the molecule. The inclu-sion of a sulfur-related CPSA descriptor suggests thatsulfoxidizability may be a factor in the toxicity of thethiophenes. Structures with SCAA-2 less than 0.0869(five compounds) were nongenotoxic.

k-NN Classification Models. A GA feature selectionroutine was used to examine descriptor subsets contain-ing from 2 to 12 members using a k-NN fitness evaluator.The k-NN fitness evaluator uses a LOO internal valida-tion method. The best-performing k-NN model containedonly three descriptors, in contrast to the LDA model’sseven descriptors. These descriptors are presented inTable 4. In addition to having significantly fewer descrip-tors, the selected k-NN model’s performance also sur-passed that of the LDA model. The confusion matricesfor the k-NN model are presented in Table 5. This modelwas able to predict 77 of 87 nongenotoxic compoundscorrectly (89%) and 23 of 33 genotoxic compounds cor-rectly (70%) for the TSET. The overall prediction ratefor the TSET was 83% correct. The model predicted 13of 14 nongenotoxic compounds correctly (93%) and four

of six genotoxic compounds correctly (67%) for the PSET.The overall prediction rate for the PSET was 85% correct.A significant feature of this model is that the genotoxiccompounds are less well-predicted than the nongenotoxiccompounds. The tendency to classify an unknown com-pound as nongenotoxic will result in a relatively largenumber of false negatives. False negatives may beconsidered to be less desirable than false positivesbecause a great deal of time and money may have beenspent developing a potential therapeutic agent before thetoxicity is discovered. Alternatively, the tendency of amodel to predict false positives may result in the preven-tion of a blockbuster drug from ever being tested. Inselecting the models presented in this work, a slight biaswas taken in favor of those with fewer false negatives inthe PSET when the total number of misclassified PSETcompounds was equal. Pairwise correlation coefficients(r values) among the three descriptors in the k-NN modelwere 0.129 (L/B-1 and HOMO-0), -0.047 (HOMO-0 andCHAA-2), and -0.090 (L/B-1 and CHAA-2).

The selected k-NN model contains one geometric andtwo electronic descriptors. The geometric descriptor isL/B-1, the molecular length to breadth ratio. This de-scriptor was also included in the LDA model. HOMO-0is the energy of the HOMO, as determined using MOPACwith the AM1 Hamiltonian. This descriptor may in factbe encoding the ability of the molecule to form DNAadducts, since the HOMO energies would likely be quiterelevant in such reactions. The final descriptor, CHAA-2, is a hydrogen-bonding descriptor and is the meanatomic charge of the H-bond acceptor atoms in themolecule.

To construct a model less biased toward predictingfalse negatives, a weighting scheme was incorporated intothe cost function of the GA-k-NN feature selectionalgorithm. Because there are 2.64 times as many non-genotoxic compounds in the TSET as there are genotoxiccompounds, weights were applied such that one falsenegative contributed 2.64 times as much as that contrib-uted by a false positive to the total cost function. Theresulting list of top models for the weighted GA-k-NNfeature selection algorithm, however, showed no improve-ment over the original GA-k-NN implementation, anda superior model was not found.

PNN Classification Models. A simulated annealingfeature selection routine was used to examine descriptorsubsets containing from two to seven members using aPNN fitness evaluator. The priors for each class wereassigned so that neither the genotoxic nor the nongeno-toxic class was favored; that is, the model was unbiasedwith respect to the populations of the individual classes.The best-performing PNN model contained three descrip-tors and is presented in Table 6. The confusion matricesfor the PNN model are presented in Table 7. This modelwas able to predict 72 of 87 nongenotoxic compounds

Table 4. Descriptors Included in the Selected GA-k-NNModel

descriptorb typea range: nontoxic range: toxic

L/B-1 G 1.056 to 2.885 1.089 to 1.742HOMO-0 E -10.81 to -8.128 -10.14 to -8.508CHAA-2 E -0.206 to 0.450 -0.123 to 0.451a G, geometric; E, electronic. b L/B-1, maximal length to breadth

ratio; HOMO-0, AM1 energy of highest occupied molecular orbital;CHAA-2, mean atomic charge of H-bond acceptor atoms.

Table 5. Confusion Matrices for the Selected GA-k-NNModel

predicted class

actual class nontoxic toxic % correct

training set 83.33nontoxic 77 10 88.51toxic 10 23 69.70

prediction set 85.00nontoxic 13 1 92.86toxic 2 4 66.67

Genotoxicity of Thiophene Derivatives Chem. Res. Toxicol., Vol. 16, No. 6, 2003 727

Page 8: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

correctly (83%) and 26 of 33 genotoxic compounds cor-rectly (79%) for the TSET. The overall prediction ratefor the TSET is 82% correct. The model predicts 12 of 14nongenotoxic compounds correctly (86%) and five of sixgenotoxic compounds correctly (83%) for the PSET. Theoverall prediction rate for the PSET is 85% correct. ThePNN model is considered to be our best model based ona single classifier found to date. The overall performanceof this model was comparable to that of the threedescriptor k-NN model, while maintaining balancedprediction rates for both nongenotoxic and genotoxicclasses. Pairwise correlation coefficients (r values) amongthe three descriptors in the unbiased PNN model were-0.223 (V5C-10 and EMIN-1), 0.322 (EMIN-1 and CHAA-1), and 0.094 (V5C-10 and CHAA-1).

The PNN model contains two topological descriptorsand one electronic descriptor. V5C-10 is the 5th ordervalence-weighted cluster connectivity (ø) index (66). Thisdescriptor encodes fragments containing exactly fivebonds connecting heavy atoms and one or two branchpoints. Like V6P-7 in the LDA model, this index isvalence-weighted and the presence of heteroatoms willinfluence the value of the descriptor in a similar fashion.In general, molecular structures containing more branchesand more nonsecond principal quantum level heteroat-oms will have higher V5C-10 values. The molecules inthe thiophene data set with the largest V5C-10 valuesare those containing relatively many rings (and thusrelatively many branches) and thiophenes that are di-,tri-, and tetrasubstituted with nonsecond principal quan-tum level halogens. V5C-10 imposes shape and atom typerequirements on a molecule in order for it to be deemedgenotoxic. EMIN-1 represents the lowest electrotopologi-cal state (e-state) (43) in the molecule. E-state indicesencode electronic accessibility within a molecule and arecalculated for each heavy atom. Lower e-states areassigned to atoms with fewer valence electrons and moreneighboring atoms (topologically further from the mol-ecule’s periphery). The e-state of an atom can be furtherlowered by neighboring atoms that have a larger numberof valence electrons and/or that are less buried in themolecule. In practice, the lowest EMIN-1 values areassigned to carbon atoms with one or more adjacent high-valence heteroatoms. In the thiophene data set, the

lowest EMIN-1 values correspond to halogenated thio-phenes and the highest EMIN-1 values are associatedwith hydrocarbon-substituted thiophenes. However, therewas no linear correlation between the EMIN-1 and theIMAX values either with or without S9. It is alsoreasonable to expect such a descriptor to appear in amodel of genotoxicity since peripheral heteroatoms areimportant determinants of aqueous solubility, lipophi-licity, and other absorption-distribution-metabolism-excretion properties traditionally associated with toxiceffects. CHAA-1 is an electronic H-bonding CPSA de-scriptor and is the sum of the H-bond acceptor atomcharges. This descriptor is a measure of the overallH-bonding ability of the molecule.

Consensus Model. Each of the three basic modelsdescribed above is a fair to good predictor of thiophenegenotoxicity. Each model also predicts a somewhat dif-ferent set of compounds correctly. A consensus model wastherefore developed to determine whether superior pre-dictions could be obtained as a result of combining thepredictions of the three model types. In the consensusmodel, the class assigned to a compound was the classthat was predicted most often for that compound by thethree basic model types. The results of the “majorityrules” consensus technique are presented in Table 8. Theprediction rates for both the TSET and the PSET areindeed improved. These results represent our best train-ing and prediction set results to date. The genotoxic andnongenotoxic classes are also predicted fairly consis-tently.

Outlier Analysis. The PSET outliers for each clas-sification model are presented in Table 9. Each of thefitness evaluators assigned classes in a different way.However, Table 1 shows that of 56 misclassified com-pounds, 17 were misclassified by more than one model.One PSET compound (83) was misclassified by both theLDA and the PNN models. Four of the nine PSET outliers(47, 52, 83, and 122) were assigned a true class basedon an IMAX value that was within 0.1 units of the cutoffvalue of 1.5 for either the -S9 or the +S9 SOS Chro-motest. These compounds could have been misclassifieddue to inaccuracy in the test itself. Five of the outliers(12, 52, 76, 89, and 117) registered nonzero interferencevalues for the Chromotest, causing uncertainty in themeasured results. However, it was noted that 77 of the140 compounds tested registered a nonzero interferencevalue. Two of the outliers (47 and 83) had nonzerotoxicity measurements, indicating that these compoundscaused a change in the number of cells in the genotoxicityscreen. In this study, 37 of the 140 compounds hadnonzero alternative toxicity values. There were no uniquefunctional groups among the outliers, and they werestructurally similar to other compounds in the TSET.Outliers 107 and 117 differ only in the oxidation stateof the side chain and were both false positives.

Table 6. Descriptors Included in the Selected GSA-PNNModel

descriptorb typea σ range: nontoxic range: toxic

V5C-10 T 0.0072 0.000 to 0.563 0.000 to 0.742EMIN-1 T 0.0360 -4.849 to 1.188 -1.512 to 0.748CHAA-1 E 0.0998 -2.466 to 1.065 -0.613 to 0.490a T, topological; E, electronic. b V5C-10, fifth order valence-

weighted cluster ø index; EMIN-1, lowest electrotopological stateindex; CHAA-1, sum of H-bond acceptor atom charges.

Table 7. Confusion Matrices for the Selected GSA-PNNModel

predicted class

actual class nontoxic toxic % correct

training set 81.67nontoxic 72 15 82.76toxic 7 26 78.79

prediction set 85.00nontoxic 12 2 85.71toxic 1 5 83.33

Table 8. Confusion Matrices for the Consensus Model

predicted class

actual class nontoxic toxic % correct

training set 86.67nontoxic 75 12 86.21toxic 4 29 87.88

prediction set 95.00nontoxic 14 0 100.00toxic 1 5 83.33

728 Chem. Res. Toxicol., Vol. 16, No. 6, 2003 Mosier et al.

Page 9: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

Randomization Experiments. To show that theresults obtained for the LDA, k-NN, PNN, and consensusmodels were not likely found due to chance correlations,randomizing experiments were performed (67). The firstpart of the experiment involved randomly scrambling theTSET dependent variable, in this case, the genotoxicclass of the compounds. The second part of the experi-ment was an attempt to construct LDA, k-NN, and PNNmodels using the same methodology as was used to buildthe actual models but using the scrambled dependentvariable data. For each model type, the dependentvariable was scrambled 100 times. Because the same 100scrambled dependent variables were used consistently forall three basic model types, consensus models were alsoobtained. In each case, the optimal scramble run modelwas chosen from the short list of best models providedby the feature selection routines based on the overallPSET classification rate. The results of these scrambleruns are presented in Figure 2 as box and whisker plots.The box delineates the 25th and 75th percentiles, with asolid line representing the median inside the box. Thewhisker caps represent the 10th and 90th percentiles,and the solid circles represent the observations outsidethe 10th and 90th percentiles.

On the basis of the populations of genotoxic andnongenotoxic compounds in the training and prediction

sets, random classification should result in about 59% ofthe PSET molecules being correctly classified. This canbe thought of in terms of rolling a 120-sided die with 87of the faces labeled N (nongenotoxic) and the other 33faces labeled G (genotoxic), since the TSET contains 87nongenotoxic and 33 genotoxic compounds. The PSETcontains 14 nongenotoxic and six genotoxic compounds.The probability of rolling an N is 87/120, and theprobability that a PSET member actually is nongenotoxicis 14/20. Thus, the probability that a nongenotoxiccompound will be correctly classified is (87/120) × (14/20). A similar argument can be made for the genotoxicclass, resulting in an overall correct classification rateof (87/120) × (14/20) + (33/120) × (6/20) ) 59%. This isshown graphically in Figure 2 as a long-dashed line. Inaddition, an overall classification rate of 87/120 ) 73%for the TSET and 14/20 ) 70% for the PSET could beachieved by simply guessing that every compound isnongenotoxic. This nonmodel makes no attempt at dis-crimination, however, and would be totally ineffective atscreening toxic compounds. This classification rate isshown as a short-dashed line in Figure 2.

In examining Figure 2, several important observationsmay be made. First, the TSET prediction rates weregenerally higher than those of the PSET. This behaviorwas expected and reflected the model building routines’tendency to overtrain using the data in the reduced pool.Note that the median PSET classification rates for theLDA and PNN type models are close to the randomclassification rate of 59%. The median PSET classificationrate for the k-NN model (as well as the 25th percentile)is 70%, consistent with the observation that the k-NNclassifier generally favors the more populous class andwill get a higher classification rate by doing so. Ingeneral, the best scrambled k-NN models reported by theGA were significantly more skewed in favor of thenongenotoxic class for both TSET and PSET than theactual (nonscrambled) model. A second observation isthat for the PSET in all three basic model types, at leastone scrambled model (LDA ) one model, k-NN ) sixmodels, PNN ) one model) was able to achieve an overallclassification rate equal to that of the actual models.However, in none of these scrambled models were theclassification rates for the genotoxic compounds for boththe TSET and the PSET greater than 50%. Scramblemodels with overall TSET classification rates equalingor exceeding those of the actual models were also found(LDA ) two models, k-NN ) one model, PNN ) twomodels). In only one of these scrambled models were theclassification rates for the genotoxic compounds for boththe TSET and the PSET greater than 33%. The one best-performing scramble model was a PNN model that wasable to classify 85% of the nongenotoxic and 79% of the

Table 9. Prediction Set Outliers

ID name CAS number typea model(s)

12 2-amino-4-chlorobenzothiazole 19952-47-7 FP LDA47 4-bromo-2-thiophenecarboxaldehyde 18791-75-8 FP LDA52 5-ethyl-2-thiophenecarboxaldehyde 36880-33-8 FN k-NN76 5,5′-dibromo-2,2′-biothiophene 4805-22-5 FP LDA83 3-(thianaphthen-3-yl)-L-alanine 308103-39-1 FN LDA PNN89 nocodazole 31430-18-9 FP PNN107 2-thiopheneethylamine 30433-91-1 FP PNN117 trans-2-(2-nitrovinyl)thiophene 34312-77-1 FP k-NN122 ethyl 2-thiophenecarboxylate 2810-04-0 FN k-NN

a FP, false positive; FN, false negative.

Figure 2. Comparison of classification rates of the best modelsfound (hollow circles) with the scrambled models (box andwhisker plots). For the scrambled models, the boxes enclose themedian and 25th and 75th percentiles. The whiskers indicatethe 10th and 90th percentiles. The long-dashed line representsthe expected PSET classification rate (59%) if the PSETmembers’ classes were assigned randomly. The short-dashedline represents the PSET classification rate (70%) if everycompound was classified as nongenotoxic. Model codes: L, GA-LDA; K, GA-k-NN; P, GSA-PNN; C, consensus. Set codes: T,TSET; P, PSET.

Genotoxicity of Thiophene Derivatives Chem. Res. Toxicol., Vol. 16, No. 6, 2003 729

Page 10: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

genotoxic compounds correctly in the TSET (83% overall)and 79% of the nongenotoxic and 67% of the genotoxiccompounds correctly in the PSET (75% overall). Like theactual models, the scramble run TSETs also seemed tobenefit somewhat from the consensus approach. How-ever, unlike the actual models, the PSETs did not produceresults that were superior to the individual models.

There are legitimate reasons why the performance ofthe scramble runs could be close to those of the actualmodels. Methods based on linear regression and discrimi-nation are sensitive to the number of compounds in thereduced pool. According to Topliss and Edwards (68),chance correlations can arise when the number of de-scriptors in the reduced pool is large as compared to thenumber of compounds in the training set. Such is the casefor LDA in this study. Additionally, models of any typewith large numbers of descriptors are prone to the so-called “curse of dimensionality” (69), in which the volumeof the descriptor space increases exponentially and thenumber of observations required to “cover” this volumeof descriptor space increases correspondingly. Thus, anoverfitted model may result when the number of observa-tions is small as compared to the number of descriptors.This may be also the case for the LDA model. However,on the basis of the results of the scramble experiments,it can be said that the actual models probably did notcome about due to chance effects.

Eleven Compound External Validation Set. Elevenadditional compounds were selected for an externalvalidation set designed to test each of the modelspresented. These were screened, and the structures werethen assigned a predicted genotoxicity using each of themodels presented. These results are shown in Table 10.The corresponding confusion matrices are shown inTables 11-14. The LDA model was able to correctlyclassify five of eight nongenotoxic compounds (63%) andtwo of three genotoxic compounds (67%). The k-NN modelwas the best performer for the external validation setand correctly predicted seven of eight nongenotoxiccompounds (88%) and two of three genotoxic compounds

(67%). The PNN model correctly classified five of fivenongenotoxic compounds (100%) and zero of two geno-toxic compounds (0%). The PNN model’s output activa-tions were not large enough to make a confident predic-tion for four of the compounds, indicated by a “#” symbolin Table 10. This is due to the fact that the predictedcompound is sufficiently far away from any training setmember in descriptor space, preventing the PNN fromproviding a confident prediction. The most notable ex-ample of this is the nongenotoxic compound 145, tetra-bromothiophene. This compound has a V5C-10 value thatis twice as large as the largest V5C-10 values in theTSET, which happen to belong to two very similarcompounds, 2,3-dibromothiophene (128) and 2,3,5-tri-bromothiophene (110), which are both genotoxic. Tetra-bromothiophene was also misclassified by the LDA andk-NN models. The consensus model gave mediocre re-sults, due primarily to the fact that the PNN model wasunable to resolve the differences between the LDA andthe k-NN model predictions for two of the compounds(147 and 149), as indicated by a “?” symbol in Table 10.The results from the external validation set as a wholesuggest that the performance of the models sufferssomewhat when they are applied to external data sets.

Table 10. Eleven Thiophene Derivatives Used as an External Validation Set; Incorrect Predictions Are Shown inParentheses

genetic toxicity

measuredb calculatedc

ID seta name CAS number +S9 -S9 L K P C

141 X 5-bromo-2,2′-bithiophene 3480-11-3 - - - - - -142 X 3-(2-thienyl)acrylic acid 15690-25-2 - - (+) - - -143 X 2-ethylthiophene 872-55-9 - - - - - -144 X methyl thiophene-2-acetate 19432-68-9 - - - - # -145 X tetrabromothiophene 3958-03-0 - - (+) (+) # (+)146 X 2-(2-hydroxy-phenoxy)-1-thiophen-2-yl-ethanone 106662-95-7 - - - - - -147 X 1-(5-ethyl-thiophen-2-yl)ethanone R433292

(Sigma-Aldrich)- - (+) - # ?

148 X 2-methylbenzothiazole 120-75-2 + + + (-) (-) (-)149 X 6-methoxy-2-methylbenzothiazole 2941-72-2 - + (-) + # ?150 X 2-amino-4,5,6,7-tetrahydrobenzo-(B)-thiophene-3-carbonitrile 4651-91-6 + - + + (-) +151 X 3-bromo-4-oxo-4-thiophen-2-yl-butyric acid 53515-21-2 - - - - - -

a X, external validation set member. b +S9, with S9; -S9, without S9. c L, LDA; K, k-NN; P, PNN; C, consensus.

Table 11. LDA Model Confusion Matrix for the ExternalValidation Set

predicted class

actual class nontoxic toxic % correct

training set 63.64nontoxic 5 3 62.50toxic 1 2 66.67

Table 12. k-NN Model Confusion Matrix for the ExternalValidation Set

predicted class

actual class nontoxic toxic % correct

training set 81.82nontoxic 7 1 87.50toxic 1 2 66.67

Table 13. PNN Model Confusion Matrix for the ExternalValidation Set

predicted class

actual class nontoxic toxic % correct

training set 71.43nontoxic 5 0 100.00toxic 2 0 0.00

Table 14. Consensus Model Confusion Matrix for theExternal Validation Set

predicted class

actual class nontoxic toxic % correct

training set 77.78nontoxic 6 1 85.71toxic 1 1 50.00

730 Chem. Res. Toxicol., Vol. 16, No. 6, 2003 Mosier et al.

Page 11: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

However, it should be noted that although the compoundsin the validation set were similar to those used whenbuilding the models, this validation set was very small,and as such, the models should not be discounted basedsolely on these results of this small test.

General Considerations. It should be noted that themodels presented here do not necessarily represent causeand effect relationships. They are simply correlationsbetween an observed genotoxicity indicator and a set ofdescriptors, which in some cases make intuitive chemicalsense but in other cases are defined in an abstract way.What we know from experience is that when statisticallysound, these descriptors and the models associated withthem have been shown to be effective predictors of theproperty being modeled. Obviously, a descriptor subsetthat makes intuitive chemical sense is an added bonus.However, the true cause and effect relationships at workmay be obscured for a number of reasons. These mayinclude but are not limited to the following. In theobjective feature selection phase, descriptors are removedfrom consideration when the pairwise correlation coef-ficients are sufficiently high. As a result, an importantor interpretable descriptor may be inadvertently removedearly in the model development process. To complicatethings, intercorrelation among the descriptors in the finalmodel can obscure a simpler explanation consisting offewer and/or more interpretable descriptors. There maybe insufficient information encoded in the calculated poolof descriptors to adequately model the observed geno-toxicity. Finally, superior models may exist, but they maynot have been found by the subjective feature selectionroutines. Even when considering these complications, themodels presented here should be useful to toxicologistsinterested in predicting the genotoxic potential ofthiophene-based organic compounds.

ConclusionsSeveral predictive binary classification models have

been presented that directly link the genetic toxicity ofa series of 140 thiophene derivatives with informationderived from the compounds’ molecular structure. Ge-netic toxicity was measured using an SOS Chromotest.IMAX (maximal SOS induction factor) values wererecorded for each of the 140 compounds both in thepresence and in the absence of S9 rat liver homogenate.Compounds were classified as genotoxic if IMAX g 1.5in either test or nongenotoxic if IMAX < 1.5 for both tests.The molecular structures were represented by numericaldescriptors that encoded the topological, geometric, elec-tronic, and polar surface area properties of the thiophenederivatives. The classification models used were LDA,k-NN, and PNN. These were used in conjunction witheither a GA or a GSA to find optimal subsets of descrip-tors for each classifier. The quality of the resultingmodels was determined by the number of misclassifiedcompounds, with preference given to models that pro-duced fewer false negative classifications. Model sizesranged from seven descriptors for LDA to three descrip-tors for k-NN and PNN. Very good classification resultswere obtained with all three classifiers. Classificationrates for the LDA, k-NN, and PNN models were 80, 85,and 85%, respectively, for the prediction set compounds.Additionally, a consensus model was generated thatincorporated all three of the basic model types. Thisconsensus model correctly predicted the genotoxicity of95% of the prediction set compounds.

Acknowledgment. We acknowledge Sondra Living-ston-Carr, Oneal Puri, and Jennifer Price.

References

(1) Johnson, D. E., and Wolfgang, G. H. I. (2000) Predicting HumanSafety: Screening and Computational Approaches. Drug Discov-ery Today 5, 445-454.

(2) Fink, S. I., Leo, A., Yamakawa, M., Hansch, C., and Quinn, F. R.(1981) The Quantitative Structure-Selectivity Relationship ofAnthracycline Antitumor Activity and Cardiac Toxicity. FarmacoEd. Sci. 35, 965-979.

(3) Quinn, F. R., Neiman, Z., and Beisler, J. A. (1981) ToxicityQuantitative Structure-Activity Relationships of Colchicines. J.Med. Chem. 24, 636-639.

(4) Cronin, M. T. D. (2000) Computational Methods for the Predictionof Drug Toxicity. Curr. Opin. Drug Discovery Dev. 3, 292-297.

(5) Matthews, E. J., Benz, R. D., and Contrera, J. F. (2000) Use ofToxicological Information in Drug Design. J. Mol. GraphicsModell. 18, 605-615.

(6) Pearl, G., Livingston-Carr, S., and Durham, S. (2001) Integrationof Computational Analysis as a Sentinel Tool in ToxicologicalAssessments. Curr. Top. Med. Chem. 1, 247-255.

(7) Durham, S., and Pearl, G. (2001) Computational Methods toPredict Drug Safety Liabilities. Curr. Opin. Drug Discovery Dev.4, 110-115.

(8) Richard, A. M. (1998) Structure-Based Methods for PredictingMutagenicity and Carcinogenicity: Are We There Yet? Mutat.Res. 400, 493-507.

(9) Greene, N. (2002) Computer Systems for the Prediction ofToxicity: An Update. Adv. Drug Delivery Rev. 54, 417-431.

(10) Sanderson, D. M., and Earnshaw, C. G. (1991) Computer-Prediction of Possible Toxic Action from Chemical-Structure -The DEREK System. Hum. Exp. Toxicol. 10, 261-273.

(11) Klopman, G. (1984) Artificial Intelligence Approach to Structure-Activity Studies. Computer Automanted Structure Evaluation ofBiological Activity of Organic Molecules. J. Am. Chem. Soc. 106,7315-7321.

(12) Woo, Y. T., Lai, D. Y., Argus, M. F., and Arcos, J. C. (1995)Development of Structure-Activity Relationship Rules for Pre-dicting Carcinogenic Potential of Chemicals. Toxicol. Lett. 79,219-228.

(13) Bacha, P. A., Gruver, H. S., den Hartog, B. K., Tamura, S. Y.,and Nutt, R. F. (2002) Rule Extraction from a Mutagenicity DataSet Using Adaptively Grown Phylogenetic-Like Trees. J. Chem.Inf. Comput. Sci. 42, 1104-1111.

(14) Yuan, M., and Jurs, P. C. (1980) Computer Assisted Structure-Activity Studies of Chemical Carcinogens. Polycyclic AromaticHydrocarbons. Toxicol. Appl. Pharmacol. 52, 294-312.

(15) Yuta, K., and Jurs, P. C. (1981) Computer Assisted Structure-Activity Studies of Chemical Carcinogens. Aromatic Amines. J.Med. Chem. 24, 241-251.

(16) Chou, J. T., and Jurs, P. C. (1979) Computer Assisted Structure-Activity Studies of Chemical Carcinogens. An N-nitroso Com-pound Data Set. J. Med. Chem. 22, 792-797.

(17) Jurs, P. C., Noor-Hasan, M., Henry, D. R., Stouch, T. R., andWhalen-Pederson, E. K. (1983) Computer-Assisted Studies ofMolecular Structure and Carcinogenic Activity. Fundam. Appl.Toxicol. 3, 343-349.

(18) Serra, J. R., Jurs, P. C., and Kaiser, K. L. E. (2001) LinearRegression and Computational Neural Network Prediction ofTetrahymena Acute Toxicity for Aromatic Compounds from Mo-lecular Structure. Chem. Res. Toxicol. 14, 1535-1545.

(19) Eldred, D. V., Weikel, C. L., Jurs, P. C., and Kaiser, K. L. E. (1999)Prediction of Fathead Minnow Acute Toxicity of Organic Com-pounds from Molecular Structure. Chem. Res. Toxicol. 12, 670-678.

(20) Eldred, D. V., and Jurs, P. C. (1999) Prediction of Acute Mam-malian Toxicity of Organophosphorous Pesticide Compounds fromMolecular Structure. SAR QSAR Environ. Res. 10, 75-99.

(21) Johnson, S. R., and Jurs, P. C. (1997) Prediction of AcuteMammalian Toxicity from Molecular Structure for a Diverse Setof Substituted Anilines Using Regression Analysis and Compu-tational Neural Networks. In Computer-Assisted Lead Findingand Optimization (Waterbeemd, B., Testa, B., and Folkers, G.,Eds.) Verlag Helvetica Chimica Acta, Basel.

(22) Stouch, T. R., and Jurs, P. C. (1985) Computer-Assisted Studiesof Molecular Structure and Genotoxic Activity by Pattern Rec-ognition Techniques. Environ. Health Perspect. 61, 329-343.

(23) Mizutani, T., Yoshida, K., and Kawazoe, S. (1994) Formation ofToxic Metaboloites from Thiabendazole and Other Thiazoles in

Genotoxicity of Thiophene Derivatives Chem. Res. Toxicol., Vol. 16, No. 6, 2003 731

Page 12: Predicting the Genotoxicity of Thiophene Derivatives from Molecular Structure

Mice: Identification of Thioamides as Ring-Cleavage Products.Drug Metab. Dispos. 22, 750-755.

(24) Machinist, J. M., Mayer, M. D., Shet, M. S., Ferrero, J. L., andRodriguez, A. D. (1995) Identification of the Human LiverCytochrome P-450 Enzymes Involved in the Metabolism ofZileuton (ABT-077) and its N-Dehydroxylated Metabolite, AB-BOTT-66193. Drug Metab. Dispos. 23, 1163-1174.

(25) Sinsheimer, J. E., Hooberman, B. H., Das, S. K., Savla, P. M.,and Ashe, A. J. I. (1992) Genotoxicity of Chryseno[4,5-bcd]-thiophene and its Sulfone Derivative. Environ. Mol. Mutagen. 19,259-264.

(26) Misra, B., and Amin, S. (1990) Synthesis and Mutagenicity ofTrans-dihydrodiol Metabolites of Benzo[b]naphtho[2,1-d]thiophene.Chem. Res. Toxicol. 3, 93-97.

(27) Hine, C., Rowe, V. K., White, E. R., Darmer, K. I., and Youngblood,G. T. (1981) Epoxy Compounds. In Patty’s Industrial Hygiene andToxicology (Clayton, G. D., and Clayton, F. E., Eds.) John Wiley,New York.

(28) IARC (1976) Cadmium, Nickel, Some Epoxides, MiscellaneousIndustrial Chemicals and General Considerations on VolatileAnaesthetics. IARC Monogr. Eval. Carciong. Risk Chem. Man 11,115-214.

(29) Abu-Shakra, A., McQueen, E. T., and Cunningham, M. L. (2000)Rapid Analysis of Base-pair Substitutions Induced by MutagenicDrugs through their Oxygen Radical or Epoxide Derivatives.Mutat. Res. 470, 11-18.

(30) Mitchell, T., and Showell, G. A. (2001) Design Strategies forBuilding Drug-like Chemical Libraries. Curr. Opin. Drug Dis-covery Dev. 4, 314-318.

(31) Hofnung, M., and Quillardet, P. (1988) The SOS Chromotest, aColorimetric Assay Based on the Primary Cellular Responses toGenotoxic Agents. Ann. N. Y. Acad. Sci. 534, 817-825.

(32) Sutton, M. D., Smith, B. T., Godoy, V. G., and Walker, G. C. (2000)The SOS Response: Recent Insights Into umuDC-DependentMutagenesis and DNA Damage Tolerance. Annu. Rev. Genet. 34,479-497.

(33) Quillardet, P., and Hofnung, M. (1993) The SOS Chromotest: AReview. Mutat. Res. 297, 235-279.

(34) Pearlman, R. S. CONCORD User’s Manual, Tripos, St. Louis, MO.(35) Stewart, J. P. (1990) MOPAC: A Semiempirical Molecular Orbital

Program. J. Comput.-Aided Mol. Des. 4, 1-105.(36) Dewar, M. J. S., Zoebisch, E. G., Healy, E. F., and Stewart, J. J.

P. (1985) Development and Use of Quantum Mechanical Molec-ular Models. 76. AM1: A New General Purpose QuantumMechanical Molecular Model. J. Am. Chem. Soc. 107, 3902-3909.

(37) Jurs, P. C., Chow, J. T., and Yuan, M. (1979) Studies of ChemicalStructure-Biological Activity Relations Using Pattern Recognition.In Computer-Assisted Drug Design (Olson, E. C., and Christof-fersen, R. E., Eds.) The American Chemical Society, Washington,DC.

(38) Stuper, A. J., Brugger, W. E., and Jurs, P. C. (1979) Computer-Assisted Studies of Chemical Structure and Biological Function,John Wiley & Sons, New York.

(39) Todeschini, R., and Consonni, V. (2000) Handbook of MolecularDescriptors. In Methods and Principles in Medicinal Chemistry(Mannhold, R., Kubinyi, H., and Timmerman, H., Eds.) Wiley-VCH, Weinheim.

(40) Balaban, A. T. (1982) Highly Discriminating Distance-BasedTopological Index. Chem. Phys. Lett. 89, 399-404.

(41) Gupta, S., Singh, M., and Madan, A. K. (1999) SuperpendenticIndex: A Novel Topological Descriptor for Predicting BiologicalActivity. J. Chem. Inf. Comput. Sci. 39, 272-277.

(42) Kier, L. B. (1985) A Shape Index from Molecular Graphs. Quant.Struct.-Act. Relat. 4, 109-116.

(43) Kier, L. B., and Hall, L. H. (1990) An Electrotopological StateIndex for Atoms in Molecules. Pharm. Res. 7, 801-807.

(44) Kier, L. B., Hall, L. H., Murray, W. J., and Randic, M. (1975)Molecular Connectivity 1: Relationship to Nonspecific LocalAnesthesia. J. Pharm. Sci. 64, 1971-1973.

(45) Liu, S., Cao, C., and Li, Z. (1998) Approach to Estimation andPrediction for Normal Boiling Point (NBP) of Alkanes Based ona Novel Molecular Distance-Edge (MDE) Vector, λ. J. Chem. Inf.Comput. Sci. 38, 387-394.

(46) Randic, M. (1975) On Characterization of Molecular Branching.J. Am. Chem. Soc. 97, 6609-6615.

(47) Randic, M. (1984) On Molecular Identification Numbers. J. Chem.Inf. Comput. Sci. 24, 164-175.

(48) Sharma, V., Goswami, R., and Madan, A. K. (1997) EccentricityConnectivity Index: A Novel Highly Discriminating TopologicalDescriptor for Structure-Property and Structure-Activity Stud-ies. J. Chem. Inf. Comput. Sci. 37, 273-282.

(49) Pearlman, R. S. (1980) Molecular Surface Areas and Volumes andTheir Use in Structure/Activity Relationships. In Physical Chemi-cal Properties of Drugs (Yalkowsky, S. H., Sinkula, A. A., andValvani, S. C., Eds.) Marcel Dekker, New York.

(50) Stouch, T. R., and Jurs, P. C. (1986) A Simple Method for theRepresentation, Quantification and Comparison of the Volumesand Shapes of Chemical Compounds. J. Chem. Inf. Comput. Sci.26, 4-12.

(51) Stanton, D. T., and Jurs, P. C. (1990) Development and Use ofCharged Partial Surface Area Structural Descriptors in Computer-Assisted Quantitative Structure-Property Relationship Studies.Anal. Chem. 62, 2323-2329.

(52) Stenberg, P., Luthman, K., Ellens, H., Lee, C. P., Smith, P. L.,Lago, A., Elliot, J. D., and Arturrson, P. (1999) Prediction of theIntestinal Absorption of Endothelin Receptor Antagonists UsingThree Theoretical Methods of Increasing Complexity. Pharm. Res.16, 1520.

(53) Davis, L. (1987) Genetic Algorithms and Simulated Annealing.In Research Notes in Artificial Intelligence (Davis, L., Ed.) MorganKaufmann, Los Altos, CA.

(54) Holland, J. H. (1975) Adaptation in Natural and Artificial System,University of Michigan Press, Ann Arbor.

(55) Kirkpatrick, S., Gelatt, C. D. J., and Vecchi, M. P. (1983)Optimization by Simulated Annealing. Science 220, 671-690.

(56) Kachigan, S. K. (1986) Statistical Analysis, Radius Press, NewYork.

(57) Fisher, R. A. (1936) The Use of Multiple Measurements inAxonomic Problems. Ann. Eugenic. 7, 179-188.

(58) Bakken, G., and Jurs, P. C. (2000) Classification of Multidrug-Resistance Reversal Agents Using Structure-Based Descriptorsand Linear Discriminant Analysis. J. Med. Chem. 43, 4534-4541.

(59) Fix, E., and Hodges, J. L. (1949) Discriminatory Analysis,Nonparametric Discrimination: Consistency Properties, TechnicalReport 21-49-004, USAF School of Aviation and Medicine, Ran-dolph Air Field, TX.

(60) Specht, D. (1990) Probabilistic Neural Networks. Neural Networks3, 109-118.

(61) Masters, T. (1995) Advanced Algorithms for Neural Networks: AC++ Sourcebook, John Wiley and Sons, New York.

(62) Mosier, P. D., and Jurs, P. C. (2002) QSAR/QSPR Studies UsingProbabilistic Neural Networks and Generalized Regression Neu-ral Networks. J. Chem. Inf. Comput. Sci., in press.

(63) Parzen, E. (1962) On Estimation of a Probability Density Functionand Mode. Ann. Math. Statistics 33, 1065-1076.

(64) Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery,B. P. (1992) Numerical Recipes in C: The Art of ScientificComputing (2nd ed.) Cambridge University Press, Cambridge.

(65) Brent, R. (1973) Algorithms for Minimization Without Derivatives,Prentice-Hall, Englewood Cliffs, NJ.

(66) Kier, L. B., and Hall, L. H. (1986) Molecular Connectivity inStructure-Activity Analysis (Bawden, D., Ed.) ChemometricsSeries, Research Studies Press, Letchworth, Hertfordshire, En-gland.

(67) Wold, S., and Eriksson, L. (1995) Statistical Validation of QSARResults. In Chemometric Methods in Molecular Design (van deWaterbeemd, H., Ed.) VCH, Weinheim.

(68) Topliss, J. G., and Edwards, R. P. (1979) Chance Factors inStudies of Quantitative Structure-Activity Relationships. J. Med.Chem. 22, 1238-1244.

(69) Bellman, R. (1961) Adaptive Control Processes: A Guided Tour,Princeton University Press, Princeton, NJ.

TX020104I

732 Chem. Res. Toxicol., Vol. 16, No. 6, 2003 Mosier et al.


Recommended