EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY

Farhad Hormozdiari

Lab for Computational Biology, Simon Fraser University

EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING

SMALL MOLECULE ACTIVITY

Outline

Small Molecule Similarity Measure Classification

Kernel Methods Nearest Neighbor classifier Centroid based Nearest Neighbor

Distance / Metric Learning Results

What are small molecules ?

Chemical compounds with small molecular mass

Important in the synthesis and maintenance of larger molecules (DNA, RNA and proteins).

High potential as medicine. Increasing number of databases: PubChem,

ChemDB, ChemBank… Standard task in in silico drug discovery:

Classifying an compound with unknown activity

Representation of small molecules

Chemical (Conventional) Descriptors: A(x)=(25, 0.24, 1, 12.3,….., 5, 2.12,

……..) Chemical structures represented by

labeled graphs

Classification methods for small molecules

Artificial Neural Networks (ANN) Support Vector Machine (SVM) K-Nearest Neighbor Classification Recent works focused on Kernel Methods

SVM (Support Vector Machine) Φ(x) fixed feature transformation tn ϵ{1,-1} Find a decision boundary

Y(x) = WT Φ(x) + b Goal to maximize the distance Dist= Quadratic programming

||||

))((

||||

)(

w

bxwt

w

xyt Tnnn

)]})(([min||||

1{maxarg , bxwtw n

Tnnbw

Recent works on small molecule classification Mariginalized Kernel (MK)

Tsuda et.al 2002, Kashima et al. (ICML 2003)

Features are number of labeled paths of random walks

Improved Mariginalized Kernel Mahe et al. (ICML 2005) Avoid totters (walks that visit a node which

was visited in two previous stages)

Recent works on small molecule Classification Swamidass et al. (Bioinformatics 2003)

Kernels based on 3D Euclidean coordinates of atoms

One histogram per pair of atom labels Similarity between histograms

Cao et al. (ISMB 2008, Bioinformatics 2010) Use Maximum Common Substructure (MCS)

as a measure of similarity Randomly pick ”basis” compounds

Features of a molecule are MCS between that molecule and all basis compounds

Nearest Neighbor Classification

Nearest Neighbor (NN) Classification The label of a molecule is predicted based

on ones of its nearest neighbors NN Error < 2*Bayes error (Cover et al.

1967) One of most used classifiers in small

molecule classification because of its simplicity

Nearest Neighbor Classification Drawbacks

Speed/Memory Distances to all traning set points should be

computed All the traning set is stored in the memory

Overfitting

Centroid based Nearest Neighbor (CBNN) Classificatrion CBNN Classification

Centroids are picked from each class Bioactivity of a small molecule is predicted based on its

nearest centroids

CBNN tackle NN drawbacks

Centroid Selection

Hart et al., 1968 introduced Condensed NN Classification Initially, the set of centroids S includes one point Iteratively go through each remaining point p, if

its nearest neighbor in S has the opposite class, p is added to S

Fast condensed NN Classification (Angiulli et al., ICML 2005) S is assigned to medoids of each class For each point in S their Voronoi cell is build

In each Voronoi cell if there exist a point from different class is added to S

Centroid Selection

Gabriel Graph (Gabriel et al. 1969,1980) There exist an edge between two points u,v

If for any point w dist(u,v) < min{dist(u,w),dist(w,v)} After the graph is built, connected nodes from

different classes are selected

u vw

Removed link

Centroid Selection

Relative Neighborhood Graph (Toussiant et al. 1980) There exists an edge between two points u,v if

for any point w, dist(u,v) < max{dist(u,w),dist(w,v)} After the graph is built, connected nodes from

different classes are selected

Combinatorial Centroid Selection Combinatorial Centroid Selection(CCS)

Given a training set of points (compounds) where distances satisfy triangle inequalities

Asked to find the minimum number of centroids (selected compounds) such that for each point, its nearest centroid is from same class

For simplicity, we only deal with binary classification i.e. C1 first class and C2 second class.

CCS Complexity

k-CCS problem Asked to select a set of points with cardinality less

than k such that for each point, its nearest centroid is from same class

k-CCS is NP-Complete K-Dominating Set (k-DS): given a graph G(V,E), ask

whether there exists V' ⊆ V, |V'| ≤ k and each node v∊V either exist in V' or it is adjecent to a node in V'

k-DS ≤p k-CCS

This reduction states no approximation better than O(log n) exists for CCS unless P = NP

Integer Linear Program Solution

Notations:

To minimize the number of chosen points or compounds (called centroids)

otherwise 0 and centroid a is if )( xx

11 C classin points are s'i

C

22 C classin points are s'j

C

Y and X points obetween tw distance metric a :),( YXd

minimize (C1i)

C1iC1

(C2 j)

C2 jC2


Ensure that for every pair of compounds i of class 1 and j of class 2, if j is chosen as a centroid, a compound k of class 1within the radius of between i and j should be chosen as a centroid as well.

11222),(),(

1

22112),(),(

1

,);()(

,);()(

..

1222

2111

CCCCCC

CCCCCC

ts

qpq

qplp

l

jij

jiki

k

CCdCCd

CCdCCd


Ensure that for each class there is a compound chosen as a centroid

21

2

1

};1,0{)(

1)(

1)(

..

22

11

CCXX

C

C

ts

CC

CC

i

i

i

i

Fixed Size Neighborhood Solution

ILP solution suffers from Huge size

due to pairwise constraints among points Potential trivial solution

Propose a relaxed version of ILP Reduce the number of constraints

for each point p within the radius equal to the distance from p to its k-th nearest neighbor of the different class there must be one centroid of same class of p

We will call this method CCNN1

Special case of CCS

When the majority of the compounds do not exhibit the bioactivity of interest All compounds that exhibit bioactivity of

interest are picked as centroids We minimize the number of compounds

chosen from compounds that does not exhibit the activity of interest

Special case of CCS

It can be reduced to Set Cover O(logn)-approximation algorithm

Set Cover problem Given a Universal Set (U) and a collection of

subsets (C) from U. Goal is to pick the minimum number of sets from C which cover all the elements in U.

NP-Complete Greedy Algorithm

Pick the set which cover the maximum number of uncoverd elements from the universal set

We will call this method CCNN2

Experimental Results - Datasets Mutageniticy dataset

includes aromatic and hetero-aromatic nitro compounds that are tested for mutagenicity on Salmonella

188 compounds with positive levels of log mutagenicity 63 negative examples

Drug dataset includes 958 drug compounds 6550 non-drug compounds including antibiotics,

human, bacterial, plant, fungal metabolites and drug-like compounds

Experimental Results - Descriptors The structures of the compounds have

been used 30 3D inductive QSAR descriptors by

Cherkasov et al. 2005 32 conventional QSAR by MOE:

Number of basic atoms Number of bonds ….

Comparison with other CBNN based methods Drug dataset

Method #Centroids

%Training Set

Accuracy

RNG 1705 28.39 89.00

GG 4804 79.99 92.00

CCNN 1489 24.79 89.89

CCNN2 1052 17.51 92.17

NN 6006 100 91.02

Comparison with small molecule classication methods Mutag Data set

Method Precision Recall Accuracy Running Time(min)

NN 87.80 92.00 86.17 1(?)

CCNN 92.00 92.74 89.94 6

CCNN2 92.13 94.35 90.91 6

SVM-Linear 92.00 92.00 89.36 6

SVM-ploy 91.30 92.00 88.83 6

SVM-Radial 86.60 92.80 85.63 6

Cao et.al. 88.2 77.8 82.35 20

MK Kashima et.al.

94.4 88.7 89.10 6

Comparison with small molecule classication methods Drug

Method Precision Recall Accuracy Running Time(min)

NN 64.70 65.30 91.02 45(?)

CCNN 56.36 61.18 89.89 181

CCNN2 69.12 69.70 92.17 150

SVM-Linear 76.10 8.70 87.89 121

SVM-Poly 77.10 38.30 90.17 180

SVM-Radial 80.10 35.00 90.60 121

Cao et.al. 81.20 56.20 92.00 ~5days

MK Kashima et.al.

53.70 57.00 89.10 ~1days

Learning the Metric Space

Emre Karakoc, Artem Cherkasov, S.Cenk Sahinalp (ISMB 2006)

Quantitative Structure-Activity Relationship(QSAR) Similarity measure

Minkowski distance

Each feature is equally significant But some features should be more

significant and some less Weighted Minkowski distance

ppn

ip iYiXL /1

1)|][][|(

|][][|),(1

1 iYiXwYXwLn

ii

Main Idea

Can weighted Minkowski be useful? Reduce the number of features.

PCA Increase the accuracy

How to learn the right W? Decrease the within-class distance Increase the between-class dist.

Learn the optimal W

Given the training set T let Active set Inactive set Min f(T)

f(T) =

},...,{ 21Am

AAA TTTT },...,{ 21

Iml

III TTTT

2

1 1 1

/|][][| miTiTw Aj

Ah

m

h

m

j

n

ii

2

1 1 1

)/()|][][|( mliTiTwml

h

nl

j

n

i

Ij

Ihi

))(/()|][][|(1 1 1

mlmiTiTwm

h

ml

j

n

i

Ij

Ahi

Learn the optimal W (cont.)

Min f(T) s.t

))(/(|][][|/|][][|1 1

2

1 1

mlmiTiTwmiTiTw Ij

Ah

ml

j

n

ii

Aj

Ah

m

j

n

ii

]1,0[iw

Metric Learning

Weinberger et al. NIPS 2006 Semidefinite program D(xi,xj) = (xi-xj)TM(xi-xj) where M = LTL

s.t. M > 0 The difference between between-class and

within-class distances is pre-fixed It aims to compute the

“best” M

Classification of new compounds Input:

Distances of new compound Q to the ones in the data-set

Assumption: Bioactivity level of Q is likely to be similar

to its close neighbors kNN classifier estimate the bioactivity of

Q: The majority bioactivity among its k-

nearest neighbors

Querying a compound

Naïve Method O(S) which S is the number element in

database. Binary search tree

Vantage Point (VP) tree (Uhlmann 1991) Binary tree that recursively partition data

space using distances of data points to randomly picked vantage point.

VP-Tree

Internal nodes: (Xvp, M, Rptr, Lptr)

M: median distance of among d(Xvp, Xi) for all Xi in the space partitioned.

Xvp: Vantage point.

Leaves: references to data points

Proximity search in VP-tree

Given a query point q, metric distance d(.,.) and a proximity radius r

Goal is to find all points x where d(x,q) < r If d(q,Xvp) – r < M recursively search the

inner partition If d(q, Xvp) + r > M recursively search the

outer partition Else search both

Can we do better?

Select multiple vantage points at each level Space Covering VP (SCVP) Trees (Sahinalp

et.al 2003) Increasing the chance of inclusion of query

in one of the inner partitions.

Can we do much better?

Instead of selecting random vantage points select them more intelligently Deterministic Multiple Vantage Point

(DMVP) Tree Select minimum number of multiple

vantage points that cover the entire data collection (OVPS problem)

Better space utilization (Optimal redundancy)

OVPS problem is NP-hard for any wLp

Conclusion

NN is powerful classifier Small molecule classification NN problem CBNN

CCNN1 and CCNN2

Distance learning Accuracy

DMVP tree

Future work

Further investigation of possible approximation algorithms for selecting centroids

Combining CCNN (selecting centroids) with metric learning

Ideally the problem formulation should ask to ensure the NN of each point in the training set is in the same class with that point

Adapt CCNN to work with regression datasets

References

Phuong Dao*, Farhad Hormozdiari*, Hossien Jowhari, Kendall Byler, Artem Cherkasov, S. Cenk Sahinalp, Improved Small Molecule Activity

Determination via Centroid Nearest Neighbors Classification, CSB 2008.

Emre Karakoc, Artem Cherkasov, S. Cenk Sahinalp Distance Based Algorithm for small Biomolecule

Classification and Structural Similarity Search, ISMB 2006 Lurii Sushko et.al. Applicability domains for classification

problems: benchmarking of distance to models for AMES mutagenicity

set, J. Chemical Informatics 2010.

Acknowledgments

Cenk Sahinalp Artem Cherkasov Zehra Cataltepe Emre Karakoc Phuong Dao Hossien Jowhari Kendall Byler All members of Lab

Questions

Date post:	04-Jan-2016
Category:	Documents
Upload:	odessa-valenzuela
View:	33 times
Download:	0 times

EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY

Documents