Download - Machine Learning Classifiers for Protein- Protein ...shodhganga.inflibnet.ac.in/bitstream/10603/5399/15/15_chapter6.pdf · machine learning classifiers trained on four gold standard

Chapter 6

Machine Learning Classifiers for Protein-

Protein Interaction Prediction

6.1 Introduction

In the previous chapter, I reported that the PPIs predicted by GN, GC, ES, PP and

GM methods show marginal overlap and hence they can complement each other.

Therefore, integration of these methods is likely to provide better information

than the single method alone. The integrative approach requires the tools and

methods capable of transforming the heterogeneous scores generated by these

prediction methods into meaningful combination. On the basis of rich

mathematical/statistical foundation, Machine Learning Classifiers (MLCs) allow

us to go beyond a mere linear score cutoff based predictions and provide hidden

trends of the data in the form of testable models using gold standard data. Then,

one can use these models to predict whether unknown protein pairs belong to

positive or negative category. There are a number of studies showing feasibility

of PPI prediction task using MLCs [54, 80, 88, 90, 91, 167]. It was suggested that

performance of MLCs can be improved by appropriate integration and selection

of features rather than by adding as many as features possible [80-82, 143, 168,

169]. However, previous reports lack critical comparative assessment of various

MLCs and data dependency. Furthermore, the combination of MLCs for

classification problems has been widely used in non-biological sciences but few

studies have evaluated multiple classifiers for PPI predictions [86, 88, 91, 170]. A

majority of these studies have been performed in Yeast as model organism [170-

172]. Some of the issues worth to be explored for E. coli encoded proteins is are:

1) which MLC is best suited for PPI predictions in E. coli, 2) the feasibility of

multiple MLC based PPI predictions and, 3) choice of a appropriate Gold

Standard (GS) dataset for learning.

In this chapter, I have investigated aforementioned issues by predicting

genome-wide PPI networks using seven different MLCs trained on four GS

datasets. The resulting PPI networks provide highly accurate protein-protein

functional interaction map generated to date, which can be used for system level

analysis of Escherichia coli proteome.

6.2 Material and Methods

6.2.1 Gold standard datasets

All MLCs applied to the PPI prediction task in this work are supervised learning

approaches. Therefore, these algorithms require GS dataset for training and

testing. The analysis was carried out using positive GS datasets composed of

Operon, DIP and Complex datasets. Because, PPI prediction methods were able to

strongly discriminate these datasets from the negative examples (Chapter 4,

section ). Considering the highest classification accuracy of Operon and

Complex datasets, it was decided to combine them together to understand the

discriminative power of MLCs on mixture of two different types of PPIs i.e.

functional (co-operonic PPIs) and physical (co-complex PPIs). The negative

examples were randomly chosen from the high confidence negative dataset

generated for testing PPI prediction methods (Chapter 2, section 2.2.2). Each

positive dataset was combined with a negative subset of size five times greater

than the number of positive examples. This was done to predict physiologically

meaningful PPIs whereas performance evaluation of MLCs was carried out using

positive and negative dataset of equal size to get fair judgment of prediction

accuracy.

6.2.2 Data feature encoding

The interaction scores (i.e. features) for all possible pairs of E. coli proteins

(including gold standard protein pairs) were calculated by GN, GC, ES, PP and GM

methods. There are two possible ways for encoding these scores for machine

learning. First, the scores for a particular protein pair generated by the five PPI

prediction methods is represented by a single value called as “Summary” [88].

The same information source is described by the five values called as “Detailed”

[88]. For the present analysis, “Detailed” method of feature encoding was used

due to less number of available features.

6.2.3 Machine learning classifiers

A set of seven MLCs was used in this study which includes Support Vector

Machine (SVM), Decision tress (DT), Random Forest (RF), Naïve Bayes (NB),

Bayesian Network (BN), Neural Network (NN) and Logistic Regression (LR).

LibSVM package (http://www.cs.waikato.ac.nz/ml/weka/) was used for SVM

based predictions and WEKA toolbox (http://www.cs.waikato.ac.nz/ml/weka/)

for other machine learning algorithms. The cost and gamma functions of SVM

were optimized using grid.py script provided in LibSVM package. J48 variant of

decision tree was used because it incorporates numerical attributes, allows post

pruning after induction of trees. The RF classifier is based on multiple DTs. A

total of 100 DTs were grown simultaneously where each node uses a random

subset of the features. In order to classify a new instance from an input vector,

each input vector is subjected to analyzed by each of the trees in the forest. The

output decision is based on majority vote over all the trees in the forest. For the

Naïve Bayes classifier, a kernel estimation parameter was switched on. All other

classifiers were applied with default parameters in the WEKA.

6.2.4 Cross-validation and performance assessment of MLCs

The positive and negative GS datasets were randomly divided into five equal

sized datasets. The protein pairs present in positive and negative dataset are

referred as true positives (i.e. interacting) and true negatives (i.e. non-

interacting). We performed five-fold cross-validation on all the MLCs using the

above mentioned gold standard datasets. In each round of cross-validation, we

used a random combination of four datasets for training the MLC and the

remaining dataset was used for testing the model. This process was repeated for

five times by using different combinations of “training” and “testing” datasets.

For each round, we calculated the sensitivity (or TPR), specificity, and positive

predictive value (or PPV or precision) as a measure of the quality of binary (two-

class) classifications using each MLC. The equations for sensitivity, PPV measures

have been described in the chapter 2, section

FPTN

TNySpecificit

+=

For all performance measures, TP is the number of positive dataset protein pairs

that are correctly predicted as interacting by given MLC. FN is the number of

positive dataset protein pairs that are incorrectly predicted as not interacting.

Similarly TN is the number of negative dataset protein pairs that are correctly

predicted as not interacting. FP is the number of negative dataset protein pairs

that are incorrectly predicted as interacting. The average performance of five

rounds of cross-validation was taken as a measure of prediction performance.

The TPR and FPR were calculated using the decision values for protein pairs

generated by SVM. For other MLCs, TPR and FPR values generated by WEKA

were obtained. The TPR and FPR values were used to plot ROC curves.

6.2.5 Genome-wide PPI prediction

The models generated by each of the seven MLCs using aforementioned four gold

standard datasets were used for genome-wide PPI predictions in E. coli. The

resulting 28 PPI networks were used to further analysis. The topological

properties of the networks were calculated as described in the chapter 5, section

5.2.2.

6.2.6 Comparison of predicted PPI networks with experimental and

functional PPI datasets

GSPPINet

GSPPINetJC GSPPINet

U

I*100),( =

. The gold

standard representing experimentally characterized and functional PPIs are

same as described in Table 4.1.

6.3 Results and Discussion

The 28 whole genome PPI maps of E. coli were reconstructed using models

generated by seven MLCs on five data features that are interaction scores

generated by GN, GC, ES, PP and GM. In the chapter 4, it was found that these

prediction methods were able discriminate Operon, Complex and DIP PPIs from

negatives with very high accuracy (Figure 4.1 & 4.2). Hence, these three datasets

were used as gold standard (GS) positive datasets to generate MLC models along

with a fourth GS called as Complex_Operon, which was a union of Operon and

Complex PPIs. The performance of Operon and Complex dataset was observed

relatively better than that of other datasets. It was assumed that their

combination would provide enough training examples of functional (Operon)

and physical (Complex) PPIs to generate better MLC model for predictions.

6.3.1 MLCs have predicted PPI networks with very high accuracy

A combination of five data features was used to train and test seven MLCs using

four GS datasets. Performance measures were averaged over five-fold cross-

validation for each MLC. As shown in the Figure 6.1, ROC curves for MLCs trained

on four GS have reported extremely good sensitivities/TPRs at the cost of less

than 0.05 FPR (i.e. 5%). It suggests that only 5% predictions are likely to be false

predictions. At the cost of 5% FPR, MLCs were able to discriminate 65, 80, 95 and

85 percent of PPIs (True positives) of DIP, Complex, Operon and

Complex_Operon from negative training examples (Figure 6.1). One of the

previous studies on PPI predictions in Yeast have reported that RF consistently

ranked as one of the top two classifiers [88]. Figure 6.1 depicts that the

performance of RF indeed better than other six classifiers using three out of four

GSs evaluated in this study. NB is the best performer when DIP GS was used for

training MLCs.

Figure 6.1 Performance accuracy measured as ROC curves for seven machine learning classifiers trained on four gold standard datasets. Each

solid line on plots depicts the machine learning classifier (MLC). The colors of the lines correspond to the Bayesian Network (BN), Decision tress (DT), Logistic Regression (LR), Naïve Bayes (NB), Neural Network (NN), Random Forest (RF), and Support Vector Machine (SVM) classifier. MLCs were trained on gold standard dataset from A) DIP database B) co-complex PPIs C) Co-operonic PPIs D) union of (B) and (C) PPIs. Performance of all MLCs using four gold standard dataset is elegant. RF classifier outperformed on (B), (C) and (D) gold standard datasets whereas NB on DIP. Note that y-axis starts at 0.4 TPR while x-axis ends at 0.05 FPR.

Another independent study had been performed on SVM based PPI

predictions in E. coli by Yellabiona and coworkers [54]. Authors have reported

average fivefold cross-validation accuracy of SVM trained on GN, GC and PP

features using Complex dataset of 0.89 (i.e. average of sensitivity and specificity).

The present study have performed on five data features which include scores

generated by ES and GM in addition to above mentioned three features. However,

the accuracy, which is obtained in the present study, is very similar to the

reported by these authors. It was expected that the performance accuracy of

classifiers would excel by two features additional features as compared to the

three used by them but it didn’t. One possible explanation for the similar results

could be the composition of negatives and positives in their GS dataset. They

have used 13 times higher numbers of negative examples as compared to the

positives for predictions, which raised specificity of their cross-validation

analysis to one as compared to the 0.97 achieved in this study. The sensitivity of

0.79 was achieved by them was preceded by 0.81 achieved in this study [54].

To a large extent ROCs are insensitive to absolute number of false

positives since the calculations are solely based on ranks of positives relative to

those of negatives [167]. So the ROCs alone can be misleading if absolute

numbers of false positives become relevant. The possible number of all protein

pairs among the proteins of any organism is too high as compared to the

estimated PPIs . Out of 8 million possible pairs of E. coli proteins, the

estimated number would be close to 40,000 probable interacting candidates only,

thereby the absolute number of false positives equally matter. In order to ensure

the quality of the final PPI predictions, PPV or precision and specificity of fivefold

cross-validation was calculated and has shown in Figure 6.2. The analysis of

Figure 6.2 A, reflects the outperformance of NB on all GS datasets and only two

percent of negative examples were predicted as potential interacting by NB. The

second best performance is observed for BN followed by SVM classifier.

Specificities of NB trained on four GS lead to observations similar to PPV plot

(Figure 6.2 B).

Although, the performance of RF classifier measured as ROC is better than

other classifiers, ability to detect negatives is substantially higher for the NB.

Hence, NB predictions are biologically more significant than RF by considering

the total space of possible protein pairs among proteins of an organism as

compared to the estimated numbers of actual PPI.

Figure 6.2 Performance accuracy measured as specificity and positive predictive value for seven machine learning classifiers trained on four gold standard datasets. Each bar on plots depicts the average of five-fold cross-

validation accuracy of machine learning classifier trained on corresponding gold standard dataset. The machine learning classifiers are Bayesian Network (BN), Decision tress (DT), Logistic Regression (LR), Naïve Bayes (NB), Neural Network (NN), Random Forest (RF), and Support Vector Machine (SVM) classifier. A) Accuracy measured as the specificity B) Accuracy measured as the positive predictive value. Performance of NB classifier trained on four gold standard datasets is highest as compared to the other classifiers. Note that y-axis starts at 85.

6.3.2 Numbers of PPIs in the networks predicted by various MLCs

varied greatly

A total of 28 PPI networks were predicted using model generated by seven MLCs

trained on four datasets consists of . The number of

PPIs predicted by each classifier and overlap has shown in the Table 6.1. The

numbers of PPIs predicted have shown substantial differences. A total number of

interactions among the proteins present in an organism are difficult to

determine by experimental method if not impossible. The range of 16,000 to

37,000 PPIs for Yeast proteome of ~6300 proteins was estimated by two

independent computational analyses . To best of our knowledge such

analysis has not been carried out in E. coli. Unlike Yeast, E. coli lacks sub-cellular

compartments and the number of proteins is 4132 suggests that the number of

interacting proteins may be in the same range. However, as shown in Table 6.1,

15 out of 28 predicted PPI networks by seven MLCs using four gold standard

datasets contains number of interactions more than 100 thousands whereas

numbers of interactions in remaining networks range from ~44,004 to 87,277.

Table 6.1 A statistics on the numbers of protein-protein interactions predicted by seven Machine Learning Classifiers (MLCs) trained on four different gold standard datasets DIP BN DT LR NB NN SVM RF

BN 225214 142329 85211 76773 104033 162323 112024

DT - 178111 83090 74976 102833 160398 114243

LR - - 108399 56412 93632 92254 88439

NB - - - 78956 59374 76059 61652

NN - - - - 163835 130807 134379

SVM - - - - - 190788 151070

RF - - - - - - 301174

Complex BN DT LR NB NN SVM RF

BN 102871 64661 41589 51675 57493 49669 63032

DT - 242485 52883 39048 87227 71679 152710

LR - - 80506 31997 73669 63791 59846

NB - - - 57806 43775 42900 42587

NN - - - - 156718 92134 100365

SVM - - - - - 106279 84983

RF - - - - - - 284619

Operon BN DT LR NB NN SVM RF

BN 45828 27841 19961 22449 27105 27781 26823

DT - 107462 66190 16900 67753 69531 50180

LR - - 80459 16324 56387 50732 50354

NB - - - 44004 22195 19485 27089

NN - - - - 87277 60267 53740

SVM - - - - - 85630 52094

RF - - - - - - 212703

Complex_Operon BN DT LR NB NN SVM RF

BN 69963 46460 31210 35080 32906 48146 37271

DT - 158500 51275 35510 65752 110627 68758

LR - - 63881 25291 46238 51636 52985

NB - - - 53541 28233 39489 33893

NN - - - - 75214 60430 55556

SVM - - - - - 84413 68309

RF - - - - - - 236841

Notes: The MLCs are Bayesian Network (BN), Decision tress (DT), Logistic Regression (LR), Naïve Bayes (NB). DIP, Complex, Operon and Complex_Operon are gold standard protein-protein interaction datasets. The numbers with bold face represent a total number of PPI

predicted by corresponding classifier. NB predicts least numbers of PPIs whereas RF predicts higher numbers.

Majority of MLCs predicted more than 100 thousand interactions using

Complex and DIP gold standard datasets. Though RF classifier has outperformed

using three GS datasets but the predicted number of interactions using each

model was more than 200 thousand. These numbers are far higher than the

estimated numbers and hence might be consist of many false predictions. On the

other hand, NB has predicted 78956, 57806, 44004 and 53541 PPIs using DIP,

Complex, Operon and Complex_Operon GS datasets respectively. These numbers

are quite close to the estimated PPIs in an organism and considering their small

numbers likely to have many potential interactions.

6.3.3 Functional coverage of the predicted networks is very less

The 28 PPI networks were reconstructed using classifiers that have given

fivefold cross-validation PPV of more than 85%. It means that the 85% of the

predicted PPIs using these models/classifiers are expected to be true

interactions. The coverage of various known PPIs in these networks is one of the

options to validate training accuracies of the MLCs. Plots in the Figure 6.3 show

the coverage of 12 datasets representing known experimental (six sets) and

functional (six sets) PPIs in the form of boxplots/Wisker’s plots. Boxplots

represent differences between groups without making any assumptions of the

underlying statistical distribution and hence they are non-parametric. The

spacing between the different parts of the box indicates the degree of dispersion

and skewness in the data. Here groups are various MLCs and each box on the plot

represents distribution of Jaccard coefficients (JC) between the predicted PPIs by

corresponding MLC and every known PPI datasets. JC represents the coefficient

(scaled in between 0-100) of similarity between two datasets taking into account

the size difference.

Even though MLCs have shown very high sensitivity values, coverage of

known PPIs in the predicted networks is very less (Figure 6.3). The training on

various types of PPIs is also not reasonable since the coverage of known

experimental (Figure 6.3 A) and functional (Figure 6.3 B) PPIs in the networks

predicted using DIP, Complex and Operon GS is almost similar. Every known PPI

dataset is underrepresented in the predicted networks, irrespective of the

training gold standard datasets and MLC used for prediction. JC have not risen

above 5 for a single classifier which was scaled in between zero and 100 (Figure

6.3). JC score of zero here reflects no overlap between two PPI networks while

Figure 6.3 Coverage of 12 known experimental and functional protein-protein interaction datasets in the networks, predicted by seven machine learning classifiers trained on four different gold standard datasets. Each boxplot on plots depicts the distribution of Jaccard similarity Coefficients

calculated between network predicted by corresponding machine learning classifier and 6 datasets of known physical and functional protein-protein interactions (PPIs). The machine learning classifiers are Bayesian Network (BN), Decision tress (DT), Logistic Regression (LR), Naïve Bayes (NB), Neural Network (NN), Random Forest (RF), and Support Vector Machine (SVM). A) Coverage of experimentally characterized PPI datasets which include His-tagged bait, TAP-tagged baits (2 sets), co-presence of proteins in the same protein complex, DIP database and synthetic lethality analysis. B) Coverage of functional PPI datasets which include protein pairs that belong to the same KEGG pathway, GO term, COG functional category, Operon, EcoCyc functional category and transcriptional regulatory associations. Coverage of experimental as well as functional datasets is higher in the networks predicted by NB classifier using all gold standard training datasets. Every known PPI dataset is underrepresented in the network predicted by different MLCs.

score 100 means complete overlap. It is consistent with the previous

observations that a computational PPI prediction complements the existing

knowledge of PPIs [54, 56, 86, 173, 174]. It was also observed that coverage of

known indirect interactions is better than the direct PPIs in the predicted

networks [54]. Here, indirect interaction is defined as link between two proteins

via its adjacent neighbors in the network rather than their direct interaction.

Therefore, the marginal coverage of known PPIs in the networks predicted in

present study is not an artifact but the limited available knowledge of biological

systems.

Unexpectedly, the coverage of experimental PPI datasets is relatively

higher in the networks predicted by NB classifier (Figure 3A). These

experimental PPI datasets include PPIs derived from His-tagged bait, TAP-tagged

baits (2 sets), co-presence of proteins in the same protein complex, DIP database

and synthetic lethality analysis. The coverage of these datasets in the NB

networks has observed substantially higher than that of RF classifier. RF was the

best performer in terms of ROC performance measure using three GS. It reflects

the shortcoming of routinely used ROC performance measure for PPI predictions.

As suggested by Park, Y, the ROC measure should be complimented with other

performance measure to ensure the percentage of positives among the predicted

PPIs [167]. The present study suggests the PPV and specificity plots can also be

useful for such cross validation. NB has outperformed when performance was

measured as PPV and specificity alone.

The coverage of available functional PPI datasets also led to the similar

observations (Figure 6.3 B). The coverage of functional PPI datasets is slightly

higher than the experimentally derived ones but still all JC scores are below 5

(Figure 6.3 B). Furthermore, the distributions of JCs for various MLCs have

differed comparably. Again, NB predicted PPI networks have higher coverage of

functionally linked proteins as compared to the classifiers except BN (Figure 6.3

B). BN also have relatively better coverage of the functional PPI datasets under

consideration. BN represents the probabilistic relationships between data

features/variables for the corresponding labels/evidence. Given data features,

the Bayesian network can be used to compute the probabilities of the label

associated with the data feature. NB is a special case of BN with strong

independence (naïve) assumptions. Hence these results suggest that the

classifiers based on the probabilistic models are better predictor of PPIs than the

other classifiers used in this study such as RF, SVM etc. In the literature, such

approaches have been indeed favored as compared to the other classifiers for

PPI predictions [86, 174-176].

6.3.4 Networks predicted by NB classifier are better than the

previous reports

With the very limited knowledge of the total PPIs existing in the E. coli, it is

challenging to compare these predictions with previously reported

computational predictions. Furthermore, not only methodology differs with each

report but also the GS used for evaluation. Hence, the direct comparison of

various approaches that are used so far is often difficult [87]. In addition,

performance of prediction often biased towards the GS, which is used for

validation of predictions. Considering the combination of five PPI predictions (i.e.

data features), multiple MLC models using four GS datasets, it was expected that

the predicted PPIs would be of high quality with respect to false predictions and

the coverage of proteins.

In the light of afore mentioned concerns, it was decided to use three

datasets that are not related to MLC training in the present study. At the same

time, these datasets are likely to be free from experimental noise. The first two

datasets consists of PPIs that are detected by the systematic large-scale tandem-

affinity purifications to isolate multiprotein complexes by two independent

studies [86, 109, 110]. The experimental set up, which was used by authors,

minimizes the spurious non-specific protein associations and enables recovery of

native protein complexes at near-endogenous levels. Third dataset was created

using pairs of proteins that have been associated with each other in at least two

experimental analysis. Then, percentage of these datasets was calculated in the

networks predicted in the present study and also for the four computational

genome-scale PPI networks reported in the literature [54, 81, 82, 86]. Most of the

networks predicted by various MLCs had limited coverage of these known

datasets as compared to the networks reported in previous studies. Some of

these networks had higher coverage, which may be likely due to the larger size of

the predicted networks. Some of the networks such as, NB and BN predicted

networks had much better coverage of these experimentally derived PPIs as

compared to the existing networks. The comparisons of these networks have

shown in the Figure 6.4. The size of these networks is comparatively less than

that of existing genome-scale PPI datasets but the coverage of experimentally

derived PPIs is substantially better than them (Figure 6.4).

Figure 6.4 Comparison of the predicted networks with previously published PPI networks. X-axis represents computationally predicted genome-scale

protein-protein interaction (PPI) networks. Networks with asterisk marks are predicted in the present study. Each bar represents percentage of a corresponding experimentally characterized PPIs in the predicted networks. Exp2 dataset represents PPIs that have been characterized by at least two different experimental analyses. The networks predicted in the present study have substantially high coverage of three PPI datasets used evaluation. Only network predicted by Yellabiona and coworkers is close to the predicted networks in this study. cNB, oNB and coNB stands for networks predicted by models generated by Naïve Bayes classifier using Complex, Operon and a union of complex & operon as a positive gold

standard dataset. Similaraly, coBN network is predicted by Bayesian Network using a union of complex & operon as a positive gold.

6.4 Summary

The present study describes the prediction of genome-wide PPI networks using

seven MLCs trained on four different GS datasets. It was observed that the

combination of the confidence scores generated by five PPI prediction methods

effectively increased classifiers accuracy. Majority of classifiers statistically show

higher training accuracies but the predicted networks lack coverage of known

PPIs. A probabilistic model based NB classifier, which assumes independence of

data features recovers substantially higher numbers of known PPIs. These

results led me the conclusion that probability based classifiers, which assume

feature independence, are best suited for PPI predictions in E. coli. The reason

behind their better coverage of known PPIs could be independent assumptions

of the five methods for PPI predictions. Thereby, each feature have

complemented others and enabled recovery of biologically meaningful

interactions, when NB classifier was used.

A combination of five PPI prediction methods, seven MLCs, and four GS

datasets, led to the prediction of 28 networks with size no more than 3,20,000

PPIs. Majority of the networks consist of PPIs in the range between 1,00,000-

2,00,000. These comparisons have encouraged me to speculate that a total

number of PPIs in bacterial cells could be around 1,50,000 which is much higher

than the previous estimates.