Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
Discovering Substructures in Discovering Substructures in Chemical Toxicity DomainChemical Toxicity Domain
Masters Project DefenseMasters Project Defense
by by
Ravindra Nath ChittimooriRavindra Nath Chittimoori
Committee: DR. Lawrence B. Holder, DR. Diane J. Cook , Committee: DR. Lawrence B. Holder, DR. Diane J. Cook , DR. Lynn PetersonDR. Lynn Peterson
Department of Computer Science and Department of Computer Science and EngineeringEngineering
University of Texas at ArlingtonUniversity of Texas at Arlington
OutlineOutline
Chemical Toxicity Database Chemical Toxicity Database Motivation and Goal Motivation and Goal Knowledge Discovery in Databases Knowledge Discovery in Databases (KDD) (KDD) SUBDUE Knowledge Discovery System SUBDUE Knowledge Discovery System Experiments with Unsupervised Experiments with Unsupervised SUBDUE SUBDUE Experiments with Supervised SUBDUE Experiments with Supervised SUBDUE Discussion of Results Discussion of Results ConclusionsConclusions Future WorkFuture Work
Chemical Toxicity Database Chemical Toxicity Database
Carcinogenesis Prediction Problem Carcinogenesis Prediction Problem
Toxicology Evaluation Challenge Toxicology Evaluation Challenge
Domain:Domain: CompoundsCompounds + + - -TotalTotal Training set Training set 162 162 136 136 298298 Experimental set Experimental set 27 27 25 25
69 69
Motivation and GoalMotivation and Goal
Ever-increasing number of chemical Ever-increasing number of chemical compoundscompounds
Needs analysis to obtain the Structure-Needs analysis to obtain the Structure-ActivityActivity relationships of a compound relationships of a compound
Determine SUBDUE’s applicability to Determine SUBDUE’s applicability to chemicalchemical toxicity domaintoxicity domain
Knowledge Discovery in Knowledge Discovery in Databases (KDD) Databases (KDD)
Process of identifying valid, novel, Process of identifying valid, novel, potentiallypotentially useful and understandable patterns in useful and understandable patterns in datadata
Goal of Knowledge Discovery:Goal of Knowledge Discovery: VerificationVerification DiscoveryDiscovery
Data mining methods Data mining methods
Model Representation, Evaluation and Model Representation, Evaluation and SearchSearch
Steps in KDD Steps in KDD
Identify the goal of the process Identify the goal of the process Collect, create and prepare the dataset Collect, create and prepare the dataset Select the data mining method Select the data mining method Select the data mining algorithm Select the data mining algorithm Transform the data Transform the data Execute the algorithm Execute the algorithm Interpret/evaluate the discovered Interpret/evaluate the discovered patterns patterns Consolidate the knowledge discovered Consolidate the knowledge discovered
SUBDUE Knowledge SUBDUE Knowledge Discovery System Discovery System
SUBDUE discovers patterns SUBDUE discovers patterns [substructures] in structural data sets[substructures] in structural data sets
objectobjecttriangletriangle
objectobjectsquaresquareonon
shapeshape
shapeshape
Vertices: objects or attributesVertices: objects or attributesEdges: relationshipsEdges: relationships
4 instances of4 instances of
SUBDUE - Input SUBDUE - Input Representation Representation
Each atom is represented as a vertex Each atom is represented as a vertex withwith directed edges to the name, type and directed edges to the name, type and the partialthe partial charge of the atomcharge of the atom
Bonds are represented as undirected Bonds are represented as undirected edges edges
Each group is represented as a vertex Each group is represented as a vertex having ahaving a string label specifying the group string label specifying the group name withname with directed edges to all participating directed edges to all participating atomatom verticesvertices
SUBDUE - Input SUBDUE - Input Representation Representation
Representation used in Unsupervised Representation used in Unsupervised SUBDUE SUBDUE A vertex having a string label A vertex having a string label specifying thespecifying the alert with directed edges to all the alert with directed edges to all the atoms inatoms in the compound the compound
Representation used in Supervised Representation used in Supervised SUBDUE SUBDUE A vertex for all the compounds with A vertex for all the compounds with string labelstring label compoundcompound The compound vertex has directed The compound vertex has directed edges to alledges to all the vertices representing the the vertices representing the activity of anactivity of an alert on a compound alert on a compound
Unsupervised SUBDUE Input Unsupervised SUBDUE Input Representation ExampleRepresentation Example
C
0.062pt
n
Ames
0.0631010C
Methyl
Atom Atompt n
gr
grpo
po
1
n - Namen - Name
t - Typet - Type
p - Partial p - Partial chargecharge
po - Positivepo - Positive
gr - groupgr - group
Supervised SUBDUE Input Supervised SUBDUE Input Representation ExampleRepresentation Example
C
0.062pt
n
Com
0.0631010C
Methyl
Atom Atompt n
gr
grcontains
1
contains
Ames
Positive
n - Namen - Name
t - Typet - Type
p - Partial p - Partial chargecharge
gr - groupgr - group
Com - Com - CompoundCompound
SUBDUE - Model Evaluation SUBDUE - Model Evaluation
Minimum Description Length Principle Minimum Description Length Principle Best theory to describe any graph Best theory to describe any graph Minimize I(S) + I(G/S)Minimize I(S) + I(G/S)
Graph Compression Graph Compression
Other important Concepts of Other important Concepts of SUBDUESUBDUE
Inexact Graph Match Approach Inexact Graph Match Approach
Concept - Learning Concept - Learning
Predefined Substructures Predefined Substructures
Unsupervised SUBDUE - Unsupervised SUBDUE - Methodology Methodology
Training set further divided Training set further divided
3 approaches to determine 3 approaches to determine carcinogenicity of compounds in carcinogenicity of compounds in experimental set experimental set -- Apply SUBDUE individually to the -- Apply SUBDUE individually to the compoundscompounds-- Inclusion of pre-defined -- Inclusion of pre-defined substructuressubstructures-- Check for matching of substructure -- Check for matching of substructure in thein the compound to be classifiedcompound to be classified
Unsupervised SUBDUE - Unsupervised SUBDUE - ResultsResults
atom
10
c
n
t p0.062
atom
br
n
t p0.057
1
3
Third approach used to classify Third approach used to classify compounds in compounds in
experimental set experimental set
Accuracy Level -> 0.322Accuracy Level -> 0.322
Cyanate & ether groups are also Cyanate & ether groups are also discovered todiscovered to
be indicators of carcinogenic activity be indicators of carcinogenic activity
Supervised SUBDUE - Supervised SUBDUE - Methodology Methodology
Create set of indicators of carcinogenic Create set of indicators of carcinogenic activity activity
Create set of indicators of Create set of indicators of noncarcinogenicnoncarcinogenic activity activity
Calculate value of substructures Calculate value of substructures discovered indiscovered in carcinogenic and noncarcinogenic set carcinogenic and noncarcinogenic set
Select a set of substructures to be Select a set of substructures to be used inused in classifying compounds in classifying compounds in experimental set experimental set
Supervised SUBDUE - Supervised SUBDUE - MethodologyMethodology
Check for the existence of these Check for the existence of these substructures insubstructures in the compound to be classified the compound to be classified
Calculate the Carcinogenic Activity Value Calculate the Carcinogenic Activity Value of theof the compound compound
Calculate the NonCarcinogenic Activity Calculate the NonCarcinogenic Activity Value of theValue of the compound compound
Determine the activity of the compound Determine the activity of the compound
Supervised SUBDUE - Results Supervised SUBDUE - Results
A set of 12 substructures discovered by A set of 12 substructures discovered by SUBDUE used to classify compounds in the SUBDUE used to classify compounds in the experimental setexperimental set
6 substructures from carcinogenic set 6 substructures from carcinogenic set include substructures which form part of include substructures which form part of groups like amino, di10, methyl, ether, groups like amino, di10, methyl, ether, halide10 and substructure which indicates halide10 and substructure which indicates compound testing positive on AMES, compound testing positive on AMES, Salmonella, etc.Salmonella, etc.
6 substructures from noncarcinogenic set 6 substructures from noncarcinogenic set include substructures which form part of groups include substructures which form part of groups like methoxy, Ar_Halide, di64, nitro and like methoxy, Ar_Halide, di64, nitro and alkyl_halide and substructure which indicates alkyl_halide and substructure which indicates compound testing negative on AMES, compound testing negative on AMES, Salmonella, etc.Salmonella, etc.
Supervised SUBDUE - Supervised SUBDUE - Substructure Example - Substructure Example - Carcinogenic SetCarcinogenic Set
Ames
Salmonella
Salmonella_n
Compound
positive
positive
positive
Supervised SUBDUE - Supervised SUBDUE - Substructure Example - Substructure Example - Carcinogenic SetCarcinogenic Set
Cl
-0.024
p
gr
t
n
-0.1239310C
AtomAtom
Halide10
gr
pt
n
n - Namen - Name
t - Typet - Type
p - Partial p - Partial chargecharge
gr - groupgr - group
Supervised SUBDUE - Supervised SUBDUE - Substructure Example - Substructure Example - NonCarcinogenic SetNonCarcinogenic Set
Ames
Salmonella
Cytogen_ca
Compound
negative
negative
negative
Supervised SUBDUE - Supervised SUBDUE - Substructure Example - Substructure Example - NonCarcinogenic SetNonCarcinogenic Set
Cl
Atom
0.477
pt
n
gr
-0.1249310C
Atom
A-H
ptn
gr n - Namen - Name
t - Typet - Type
p - Partial p - Partial chargecharge
gr - groupgr - group
A-H - Alkyl A-H - Alkyl HalideHalide
Supervised SUBDUE - Results Supervised SUBDUE - Results
PTE-1 Results: PTE-1 Results: CompoundsCompounds + + -- TotalTotal PTE-1 PTE-1 20 20 1919 39 39 Correct PredictionCorrect Prediction 12 12 66 18 18 Incorrect Prediction 8Incorrect Prediction 81313 22 22
Accuracy: 0.6 (+ ), 0.315 (-) , 0.462 Accuracy: 0.6 (+ ), 0.315 (-) , 0.462 (total)(total)
Supervised SUBDUE - Results Supervised SUBDUE - Results
PTE-2 Results:PTE-2 Results: CompoundsCompounds + + - -
TotalTotal PTE-2 PTE-2 7 7 6 6
13 13 ** Correct PredictionCorrect Prediction 4 4 3 3
7 7 Incorrect Prediction 3Incorrect Prediction 3 3 3
6 6 * :* : # of compounds whose activity is # of compounds whose activity is knownknown
Accuracy : 0.572 (+ ), 0.5 (-) , 0.538 Accuracy : 0.572 (+ ), 0.5 (-) , 0.538 (total) (total)
Results - Discussion Results - Discussion
Unsupervised SUBDUE successful in Unsupervised SUBDUE successful in discoveringdiscovering lead indicators of carcinogenic activity lead indicators of carcinogenic activity
Supervised SUBDUE also successful inSupervised SUBDUE also successful in discovering lead indicators of discovering lead indicators of carcinogeniccarcinogenic activity activity
ILP System PROGOL: PTE-1 (0.72), PTE-ILP System PROGOL: PTE-1 (0.72), PTE-2 (0.62)2 (0.62)
Ashby, TOPKAT are other toxicity Ashby, TOPKAT are other toxicity predictionprediction methods methods
Conclusions Conclusions
Consistent with results obtained by Consistent with results obtained by logic basedlogic based systems like PROGOL systems like PROGOL
Prefer to use Concept Learner when Prefer to use Concept Learner when positive andpositive and negative examples of target concept negative examples of target concept available available
SUBDUE is capable of discovering leadSUBDUE is capable of discovering lead indicators of indicators of carcinogenic/noncarcinogeniccarcinogenic/noncarcinogenic activity in chemical toxicity domain . activity in chemical toxicity domain .
Future WorkFuture Work
PTE-3 Evaluation Challenge PTE-3 Evaluation Challenge
Trimmed Data Sets (Partial Charge)Trimmed Data Sets (Partial Charge)
Newer Version of Concept Learning Newer Version of Concept Learning SUBDUE beingSUBDUE being
developeddeveloped
ReferenceReference
http://cygnus.uta.edu/subduehttp://cygnus.uta.edu/subdue