Predictive Model for Blood-Brain Barrier

It is essential to determine whether a candidate molecule is capable of penetrating the BloodBrain Barrier (BBB) in drug discovery and development. As all the molecules cannot enter intobrain, due to barrier, this barrier is known as blood-brain barrier. Blood brain barrier will notallow all the molecules into the brain, only those molecules which contain the high concentrationwith blood cells are allowed into brain. The objective of our work is to find which of the moleculespenetrate into the brain. Computational work is carried out using R tool on the dataset garneredfrom forensic lab, Bengaluru. Among the 5 machine learning techniques namely SVM (SupportVector Machine), Neural Network, Random Forest, Decision Tree and Multiple Linear Regression,experimental result reviews that SVM (Support Vector Machine) gives better result compare toother techniques for regression data and Decision Tree generates least error rate forclassification data.


    The bloodbrain barrier (BBB)[1-10] is a highly selective permeability barrier that separates the

    circulating blood from the brain extracellular fluid (BECF) in the central nervous system (CNS). The

    bloodbrain barrier is formed by brain endothelial cells, which are connected by tight junctions with an

    extremely high electrical resistivity of at least 0.1 m. The bloodbrain barrier allows the passage of

    water, some gases, and lipid-soluble molecules by passive diffusion, as well as the selective transport of

    molecules such as glucose and amino acids that are crucial to neural function.

    Data mining has been used intensively and expansively by several organizations. In healthcare, data

    mining is becoming increasingly prevalent, if not increasingly necessary. Data mining applications can

    prominently benefit all parties intricate in the healthcare industry. For example, data mining can help

    healthcare insurers detect fraud and abuse, healthcare organizations make customer relationship

    management decisions, physicians identify effective treatments and best practices, and patients receive

    improved and more reasonable healthcare services. The huge amounts of data generated by healthcare

    transactions are multifaceted and voluminous to be processed and analyzed by traditional methods. Data

    mining provides the methodology and technology to transform these banks of data into useful

    information for decision making [11,12, 21].

    Data mining on medical data[20] has great potential to improve the treatment quality of hospitals and

    increase the survival rate of patients. Medical data mining is one of crucial issues to get valuable clinical

    knowledge from medical databases. Early prediction methods have become an seeming need in many

    clinical areas. Clinical study has found initial detection and intervention to be vital for averting clinical

    falling in patients at general hospital [13]. The paper is organized as follows. Section 2 briefs about the

    work related to BBB. Section 3 describes R tool and the methodologies adopted for the current work.

    Results are discussed in section 4 followed by conclusion and future work in Section 5.


    Scott Doniger, have used 50 molecules of of which 25 are active molecules and other 25 are in

    active molecules, which has been divided into training dataset and test dataset randomly. Two different

    algorithms are implemented namely Neural Network and Support Vector Machine. 30 validation sets

    have done out of these 50 molecules. The results show that the support vector machine outperforms the

    neural network. It was found that SVM can predict up to 96% of the molecules correctly, averaging 81.5%

    where as neural network averages 75.7% [4]. An Artificial Neural Network (ANN) model has been

    developed to predict the ratios of the steady-state concentrations of drugs in the brain to those in the

    blood (logBB) from their molecular structural parameters [9]. Claudia Suenderhauf have taken the

    dataset consisting of 153 compounds and these molecules were compiled using more reliable in vivo BBB

    permeability-surface area (logPS) products, which are obtained by direct internal carotid artery

    perfusion. The open source Chemical Development Kit (CDK) was used to calculate physico-chemical

    properties and descriptors. This data was split into two classes namely positively (CNSp+) and negatively

    (CNSp) classiGied molecules refers to compounds with logPS values 2 and 3, respectively. The DTI

    paradigm is an efficient and powerful method to solve even linearly inseparable problems. Two widely

    used paradigms were used to induce decision trees. Decision tree built with the chi-squared automatic

    interaction detector (CHAID) on CDK descriptors and Classification and regression tree (CART) based on

    CDK descriptors [3]. Misha Denil has taken dataset containing random179 molecules and calculated

    using random forest algorithm. They compare this experimental values with theoretical values, it was

    found that experimental values gave better result than the theoretical value [18].


    A. R Programming

    R is a programming language and software environment for statistical computing and graphics. The R

    language is widely used among statisticians and data miners for developing statistical software and

    data analysis. Users can access R tool through a command-line interpreter. R Tool is a statistical

    tool/platform/programming language which is free and open-source. It permits users to extend the

    capabilities of R are extended through user-created packages, which allow specialized statistical

    techniques, graphical devices import/export capabilities, reporting tools etc. [13-14].

    B. Machine Learning Techniques

    The following machine learning techniques have been experimented using r tool.

    Decision Tree - A Decision Tree(DT) represents a set of rules that follows a hierarchy of classes and

    values, used to classify the instances. An instance is classified by starting to test the attribute

    specified by the root and then following the branch corresponding to the value of the attribute in the

    instance. This process is then repeated for the sub-tree with root on the new node[17,18,21]. Package

    rpart is to be included for Decision Tree in R tool [13-14].

    Decision Tree has the following advantages:

    Can be applied to any type of data

    The final structure of the classifier is quite simple and can be stored and handled in a graceful


    Handles very proficiently conditional information, subdividing the space into sub-spaces that are

    handled individually.

    Reveal normally robust and unresponsive to misclassification in the training set.

    Random Forest - The Random Forest (RF) algorithm is based on the features of decision trees, but in its

    place of having only one tree, there is a group of decision trees. The algorithm grows many result trees, in

    order to improve predictive accuracy. It classifies one case using each tree in the new forest, and select a

    final predicted outcome by conjoining the results through all trees using majority vote [18]. Package rf is

    to be included for Random Forest in R tool [13-14].

    Features of Random Forests include:

    It is unexcelled in accuracy among current algorithms.

    It runs competently on huge data bases.

    It can handle thousands of input variables without variable deletion.

    It gives approximations of what variables are significant in the classification.

    Neural Network - ANN has been extensively in the field of healthcare. Neural networks is a non-linear

    statistical data modeling used for classifications tasks. It makes use of interconnected artificial neurons to

    process information those changes through an iterative process, where weights between neurons are

    successively corrected. Neural networks are highly subtle to the data and generally they have reduced

    ability to extrapolate beyond the restrictions of the input variables [19-21]. Package nnet is to be

    included for Neural Network in R tool [13-14].

    Support Vector Machine - A support vector machine (SVM) searches for support vectors which are

    observations that are found to lie at the edge of an area in space which presents a boundary between one

    of these classes of observations. SVM is used to classify the data which is non-separable data [21].

    Package e1071 is to be included for Support Vector Machine in R tool [13-14].


    A. Dataset

    There were 1665 molecular descriptors which are concentrated with blood cells. 1665 molecular

    descriptors is very complicated to calculate and find result. Hence weka(Waikato Environment for

    Knowledge Analysis) tool is used for selecting significant features CfsSubsetEval module followed

    by associate F-stepping (leave one out) . From these CfsubsetEval module 77 molecular descriptors are

    selected based experimental logBB values. From these 77 molecular descriptors further reduce to 13

    descriptors. These 13 molecular descriptors are highly concentrated with blood cells, molecules which

    contain these descriptors are entered into brain. 135 compounds such as benzene, cyclopropane,

    Aminopyrine, isoflurane, methane, propranolol, hydroxyzine, nitrous oxide, etc.. are selected and

    produced as a dataset. This dataset is read in R tool/software. Table 1 gives the description about the 13

    molecular descriptors.

    logBB value i.e BloodBrain distribution concentration is computed. Experimental values that have

    logBB>=0 is labeled as BBB+ and those which have logBB

    The dataset is used both for regression model and classification model, where regression analysis mainly

    deal with the continuous data and classification analysis mainly deal with discrete data.

    Performance measures used for regression model [9-10] is R-Square for techniques Decision Tree,

    Random Forest, Neural Network, Support Vector Machine and Multi Linear Regression.

    R-squared as the square of the correlation - The term "R-squared" is derived from this definition. R-

    squared is the square of the correlation between the model's predicted values and the actual values. This

    correlation can range from -1 to 1, and so the square of the correlation then ranges from 0 to 1. The

    greater the magnitude of the correlation between the predicted values and the actual values, the greater

    the R-squared, regardless of whether the correlation is positive or negative.

    Figure 1 to 5 shows R-square for regression model for Decision Tree, Random Forest, Neural Network,

    Support Vector Machine and Multi Linear Regression respectively. In figure 1-5 blue line indicate the true

    points with best fit and black line along diagonal representing perfect correlation. The R-square value for

    regression data using Decision Tree, Random Forest, Neural Network, Support Vector Machine and Multi

    Linear Regression is found to be 0.4591, 0.7388, 0.7723, 0.8845, 0.7676 respectively as shown in figure

    1-5. For the regression model among the 5 classifiers SVM resulted in the best R-square value.

    Performance measure used for classification model is over all error rate for techniques Decision Tree,

    Random Forest, Neural Network, Support Vector Machine. Figure 6-9 gives over all error rate measure

    for classification model for Decision Tree, Random Forest, Neural Network, and Support Vector Machine

    respectively. Decision Tree constructed for classification data is shown in figure 6. It shows that TSPA

    (NO), Rle., Mor04 are significant molecular descriptors. Left sub-plot of figure 7 Conditional variable

    importance calculated by randomly shuffling the values of a given dataset. Then, the difference of the

    model accuracy before and after the random permutations, averaged over all trees in the forest, tells us

    how important that predictor is for determining the outcome. For the right sub-plot in figure 7

    experiments have conducted using 100 trees and number of variables tried at each split is 2. The final

    measure of importance is the total decrease in a decision tree node's impurity (the splitting criterion)

    when splitting on a variable. The splitting criterion used is the Gini index. This is measured for a variable

    over all trees giving a measure of the mean decrease in the Gini index of diversity relating to the variable.

    Based on this experiment left sub-plot of figure 8 indicates that TPSA(NO), Rle. , noPH are top 3

    significant molecular descriptors, whereas right sub-plot of figure 8 indicates that TPSA(NO), Mor04m,

    MATS5m are top 3 significant molecular descriptors. Neural network of (13-3-1) is shown in figure 8.

    Figure 9 shows the outcome of Support Vector Machine for classification data where circles represent the

    BBB+ train and dark circles represent the BBB +test, triangles represent the BBB- train and dark triangle

    represent BBB- test where the BBB+ molecules will penetrate into the brain.

    Figure 10 gives the comparison between classifiers for regression model for Decision Tree, Random

    Forest, Neural Network, Support Vector Machine and Multi Linear Regression respectively. SVM provides

    best classification for regression data. Figure 11 gives the comparison between classifiers for

    classification model for Decision Tree, Random Forest, Neural Network, Support Vector Machine

    respectively. Decision Tree generates least error rate for classification data.

    Fig 1. Predicted Vs Observed logBBvalue using Decision Tree model for regression data



    2. Predicted Vs Observed logBB value using Random Forest model for regression data

    Fig 3. Predicted Vs Observed logBB value using Neural Network model for regression data

    Fig 4. Predicted Vs Observed logBB value using Support Vector Machine model for regression data

    Fig 5. Predicted Vs Observed logBB value using Multi Linear Regression model for regression data

    Fig 6. Decision Tree for classification data

    Fig 7. Random Forest model for classification data

    Fig 8. Neural Network model for classification data

    Fig 9. Outcome of Support Vector Machine model for classification data

    Fig 10. Comparison of classifiers for regression data

    Fig 11. Comparison of classifiers for classification data

    Table 1. Description of 13 molecular descriptors



    Name Description

    1 No Number of Oxygen atoms

    2 BIC1 Bond Information Content index (neighborhood symmetry of 1-order)

    3 MATS5m Moran autocorrelation of lag 5 weighted by mass

    4 MATS5v Moran autocorrelation of lag 5 weighted by van der Waals volume

    5 Mor04m signal 04 / weighted by mass

    6 R1e+ R maximal autocorrelation of lag 1 / weighted by Sanderson


    7 nArNR2 number of tertiary amines (aromatic)

    8 nOHp number of primary alcohols

    9 C-028 R--CRX

    10 C-034 R--CR..X

    11 H-051 H attached to alpha-C

    12 O-057 phenol / enol / carboxyl OH

    13 TPSA(NO) Topological polar surface area using N,O polar contributions


    Blood brain barrier will not allow all the molecules into the brain, only those molecules which contain the

    high concentration with blood cells are allowed into brain. Earlier studies used manually selected

    descriptors for prediction. The objective of our work is to find which of the molecules penetrate into the

    brain. Weka tool has been used to find the 13 significant descriptors out of 1665 descriptors. These

    descriptors are highly correlated with the log BB property. Experiments have been conducted using

    Decision Tree, AdaBoost, Random Forest, SVM, and Neural Networks for regression data and

    classification data. This work is at most important for the pharmacy department to find which compound

    penetrates into the brain based on 13 significant molecules. Experiments have been conducted using

    with 137 compounds, in future we would like to work with 150 more compounds, other than 137

    compounds used for the current work. In addition as part of further work, authors would like to explore

    computational work using many more data mining techniques like KNN, Nave Bayes, Bayesian classifier

    and ensemble learning like stacking, voting, grading, bagging and many more. Furthermore, authors

    would like to adopt various bio inspired optimization techniques for significant feature selection which

    would not only improve the performance of the classifiers but also reduce the computation time.


