Molecular Similarity Searching Using Inference Network · 2013-04-09 · Molecular Similarity...

transcript

Molecular Similarity Searching Using Inference Network

Ammar Abdo, Naomie Salim* Faculty of Computer Science & Information Systems

Universiti Teknologi Malaysia

Molecular Similarity Searching

•  Search for chemical compounds with similar structure or properties to a known compound

•  A variety of methods used in these searches –  Graph theory –  1 D, 2D and 3D shape similarity, docking similarity, electrostatic

similarity and others. –  Machine learning methods e.g. BKD,SVM,NBC,NN

•  Vector space model using 2D fingerprints and Tanimoto coefficients is one the most widely used molecular similarity measure

Rationale for Chemical Similarity

•  Similar property principle ―  structurally similar molecules are likely to have

similar properties •  Given a known active molecule, a similarity

search can identify further molecules in the database for testing

Probabilistic models (Alternative approach)

•  Why probabilistic models –  Information Retrieval deals with Uncertain

Information •  Query and compounds characterizations are

incomplete – Probability theory seems to be the most

natural way to quantify uncertainty – Applied in IR for text document

Why Bayesian Networks

– Bayesian Nets is the most popular way of doing probabilistic inference in AI

– Clear formalism to combine evidences – Modularize the world (dependencies) – Bayesian Network Models for IR

•  Inference Network (Turtle & Croft, 1991) • Belief Network (Ribeiro-Neto & Muntz, 1996)

– Simple

Bayesian inference

•  Bayes’ Rule : the heart of Bayesian techniques P(H|E) = P(E|H)P(H) / P(E) where, H is a hypothesis and E is an evidence P(H) : prior probability P(H|E) : posterior probability P(E|H) : probability of E if H is true P(E) : a normalizing constant, then we write: P(H|E) ~ P(E|H)P(H)

Bayesian Networks

•  What is a Bayesian networks ? –  It is directed acyclic graphs (DAGs) in which nodes

represent random variables, –  The parents of a node are those judged to be direct

causes for it. –  The root of the network are the nodes without parents. –  The arcs represent casual relationships between these

variables, and the strengths of these casual influences are expressed by conditional probabilities.

x1… xn : parent nodes, X the set of parents of y (in this case, root nodes)

y : child node xi cause y The influence of X on y can be quantified by any function

(conditional probabilities) F(y,X)=P(y|X)

x2 x1 xn

p(c|a,b) for all values for a,b,c

Conditional dependence

•  Running Bayesian Nets: •  Given probability

distributions for roots and conditional probabilities of nodes, we can compute apriori probability of any instance

•  Changes in parents (e.g., b was observed) will cause recomputation of probabilities

Bayesian networks

•  How to describe and compare molecules –  Network Model generation

•  Description of the system in a suitable network form

–  Representation of importance of descriptors (weighting schemes)

–  Probability estimation for the network model –  Calculate the similarity scores

Bayesian networks approach to molecular similarity searching

Bayesian inference network

•  Nodes –  compounds (cj) –  features (fi) –  queries (q1, q2, and qr) –  target (A)

•  Edges –  from cj to its feature

nodes fi indicate that the observation of cj increase the belief in the variables fi.

Definitions

•  f1, cj, and q1 are random variables. •  F=(f1, f2, ...,fn) is an n-dimensional

vector (equal to fingerprint length) •  fi,∀i∈{0, 1}, then f has 2x2n possible

states •  cj,∀j∈{0, 1}; ∀q∈{0, 1} •  The rank of a compound cj is

computed as P(q=true| cj=true) •  (cj stands for a state where cj=true

and ∀ i≠j ⇒ ci =false, because we observe one compound at a time)

Direct Acyclic Graph (DAG) of •  compound nodes as roots,

contain prior probability of observing compound

•  feature nodes as leaves, contain probability associated with node given set of parent compounds

Construct Compound Network (once)

•  Inverted DAG with single leaf for target molecule, multiple roots that correspond to the features that express query.

•  A set of intermediate query nodes may also be used in case of a multiple query used to express the target.

•  Attach it to compound Network

Construct Query Network for each query

–  Find probability that target molecule (A) is satisfied given compound cj has been observed

•  Instantiate each cj which corresponds to attaching evidence to network by stating that cj is true and rest of compounds as false

–  Find subset of cj’s which maximizes the probability value of node A (best subset).

–  Retrieve these cj’s as the answer to query.

Similarity calculation

•  The retrieval of an active compound compared to a given target structure is obtained by means of an inference process through a network of dependences.

•  To achieve the inference process –  We need to estimate the strength of the relationships represented by

network –  This involves estimating & encoding a set of conditional probability

distributions •  The inference network we have described comprise of four different

layers of nodes (four different random variables), first layer comprise the compound nodes (roots)

•  The probability associated with these nodes is define as:

P(cj)=1/(collection size) ⇒ prior probability

•  The second layer comprise of the feature nodes, so we need to compute P(fi). •  P(fi|cj) will be computed as follows, since dependency is based on first layer

(parent nodes).

•  Weighting function is used to estimate the probability in p(fi /cj)

•  where α is a constant and experiments using the inference network (Turtle, 1991) show that the best value for α is 0.4, ffij is the frequency of the ith feature within jth compound, icfi is the inverse compound frequency of the ith feature in the collection, clj is the size of jth compound, total_cl is the total length of compounds in the collection, and m is total number of compounds in the collection (this Eq. has been adapted from Okapi retrieval system (Robertson et al., 1995))

•  The third layer comprises only the query nodes p(qk)

•  where cjk the set of features in common between jth compound and kth query , clj is the size of jth compound, nffik is the normalized frequency of the ith feature within kth query, nicfi is the normalized inverse compound frequency of the ith feature in the collection and pi is the estimated probability at the ith feature node.

where ffij is the frequency of the ith feature within jth compound,

•  The last layer comprises only the activity-need node (target) or bel(A) in the case of where more than one query is used.

Weighted MAX Weighted SUM

•  where cjk is the set of feature in common between jth compound and kth query, qlk is the size of the kth query, pjk is the estimated probability that the kth query is met by the jth compound, and r is the number of queries.

•  Subset of MDDR with 40751 molecules –  12 activity classes

•  In all, 6804 actives in the 12 classes –  10 set of 10 randomly chosen compounds from each activity

class (to form a set of queries). –  For comparison purpose, similarity calculation is also done using

non-binary Tanimoto coefficient –  Six different type of weighted fingerprints from Scitegic

•  atom type extended-connectivity counts (ECFC), •  functional class extended-connectivity counts (FCFC), •  atom type atom environment counts (EEFC), •  functional class atom environment counts (FEFC), •  atom type hashed atom environment counts (EHFC), and •  functional class hashed atom environment counts (FHFC)

Experimental details

no. of unique av. no. mols. diversity

Code Activity class Actives AFa MFb AF MF mean SD

5H3 5HT3 antagonists 213 133 87 1.60 2.45 0.8537 0.008

5HA 5HT1A agonists 116 67 54 1.73 2.15 0.8496 0.007

D2A D2 antagonists 143 109 75 1.31 1.91 0.8526 0.005

Ren Renin inhibitors 993 542 328 1.83 3.03 0.7188 0.002

Ang Angiontensin II AT1 antagonists 1367 698 396 1.96 3.45 0.7762 0.002

Thr Thrombin inhibitors 885 528 335 1.68 2.64 0.8283 0.002

SPA Substance P antagonists 264 119 78 2.22 3.38 0.8284 0.006

HIV HIV-1 protease inhibitors 715 455 330 1.57 2.17 0.8048 0.004

Cyc Cyclooxygenase inhibitors 162 83 44 1.95 3.68 0.8717 0.006

Kin Tyrosin protein kinase inhibitors 453 247 162 1.83 2.80 0.8699 0.006

PAF PAF antagonists 716 381 252 1.88 2.84 0.8669 0.004

HMG HMG-CoA reductase inhibitors 777 337 168 2.31 4.63 0.8230 0.002 a Unique AF is the number of unique atomic frameworks present in the class. b Unique MF is the number of unique molecular frameworks present in the class.

MDDR Data

Use of a single reference structure

Highest diverse class

Comparison of the average percentage of unique atomic frameworks obtained in the top 5% of the ranked test set using BIN & Tan with EHFC_4

Highest diverse class

Use of multiple reference structures

Comparison between BIN & Tan using MAX rule and ECFC_4

Use of multiple reference structures

Comparison of the average percentage of atomic frameworks retrieved obtained in the top 5% of the ranked test set using BIN-MAX & Tan-MAX with ECFC_4

•  So far have considered using just a single molecular descriptor and multiple reference structures as the basis for a search

•  Further work –  to search with multiple molecular descriptors

(ECFC4, EHFC4, FHFC4, FPFC4,PHPFC3) with single and multiple reference structures

BIN with multiple molecular descriptors

Use of a single molecular descriptors and a single reference structure

c1 c2 cm

f1 fn f1 fn f1 fn

q1 qr D2

q1 qr Ds

Feature nodes

Compound nodes

Query nodes

Target node

wmax1 wmaxs Weighted-max link matrices

Use of multiple molecular descriptors and a single reference structure

Comparison between multiple descriptors and single descriptor with single reference structure using BIN

Comparison between multiple descriptors and single descriptor with multiple reference structures using BIN

Use of multiple molecular descriptors and a multiple reference structures

•  BIN method with a single active reference structure outperforms the Tanimoto similarity method in 11 classes (between 6% to 71%) –  19% overall improvement –  only in one activity (Cyclooxygenase inhibitors) BIN is slightly inferior

to Tan (-5%)

•  BIN with multiple reference structures superior to Tan in all activity classes (between 5% to 118%) significantly outperform Tan –  with overall improvements 35% performance improvement in the

overall average recall rate

Summary I

•  BIN with multiple descriptors and single reference structure slightly outperform the BIN with single descriptor and single reference

•  BIN with multiple descriptors and multiple reference structures slightly outperform the BIN with multiple descriptors and multiple references

•  BIN with multiple descriptors will enhance performance (with a high percentage) when the sought actives are structurally heterogeneous. –  But it will slightly enhance performance when the sought actives

are structurally homogeneous.

Summary II

•  Some evidence to suggest that the BIN is more effective at scaffold hopping for the more diverse data sets.

•  The networks do not impose additional costs because the networks do not include cycles.

•  The major strength is net combination of distinct evidential sources to support the rank of a given compound.

•  BIN provide the ability to integrate into a single framework, several descriptors and several references

Summary III

Thank you

Molecular Similarity Searching Using Inference Network · 2013-04-09 · Molecular Similarity...

Documents