+ All Categories
Home > Documents > Investigating Machine Learning Based Prediction of...

Investigating Machine Learning Based Prediction of...

Date post: 12-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
173
Wajid Arshad Abbasi 2019 Department of Computer and Information Sciences Pakistan Institute of Engineering and Applied Sciences Nilore, Islamabad, Pakistan Investigating Machine Learning Based Prediction of Protein Interactions
Transcript
  • Wajid Arshad Abbasi

    2019

    Department of Computer and Information Sciences

    Pakistan Institute of Engineering and Applied Sciences

    Nilore, Islamabad, Pakistan

    Investigating Machine Learning Based

    Prediction of Protein Interactions

  • This page intentionally left blank.

  • Reviewers and Examiners

    Foreign Reviewers

    1. Dr. Brian J. Geiss, Associate Professor, Colorado State University (CSU), USA

    2. Prof. Dr. Shihua Zhang, Professor, Chinese Academy of Science (CAS), China

    3. Dr. Henri Xhaard, Assistant Professor, University of Helsinki, Finland

    Thesis Examiners

    1. Prof. Dr. Ijaz Mansoor Qureshi, Professor, Air University, Islamabad

    2. Dr. Hammad Naveed, Associate Professor, NUCES, Islamabad

    3. Dr. Imran Amin, Principal Scientist, NIBGE, Faisalabad

    Head of the Department (Name): Dr. Asifullah Khan

    Signature with Date: _________________________________

  • Thesis Submission Approval

    This is to certify that the work contained in this thesis entitled Investigating machine

    learning based prediction of protein interactions, was carried out by Wajid Arshad

    Abbasi, and in my opinion, it is fully adequate, in scope and quality, for the degree of

    Ph.D. Furthermore, it is hereby approved for submission for review and thesis defense.

    Supervisor: ___________________________________

    Name: Dr. Fayyaz ul Amir Afsar Minhas

    Date: 28 March, 2019

    Place: PIEAS, Islamabad.

    Head, Department of Computer and Information Sciences: ___________________

    Name: Dr. Asifullah Khan

    Date: 28 March, 2019

    Place: PIEAS, Islamabad.

  • Investigating Machine Learning Based

    Prediction of Protein Interactions

    Wajid Arshad Abbasi

    Submitted in partial fulfillment of the requirements

    for the degree of Ph.D.

    2019

    Department of Computer and Information Sciences

    Pakistan Institute of Engineering and Applied Sciences

    Nilore, Islamabad, Pakistan

  • ii

    Dedications

    To my grandparents and my uncle Asif Habib Abbasi - who are not in this world

    anymore but continue to live on in my heart. Also, to my parents, my wife Dr. Saiqa

    Andleeb, and my daughter Zarnish Habib, whose love and support have been

    fundamental in completing my thesis.

  • iii

    Author’s Declaration

    I, Wajid Arshad Abbasi hereby declare that my Ph.D. thesis titled “Investigating

    machine learning based prediction of protein interactions” is my own work and has not

    been submitted previously by me or anybody else for taking any degree from Pakistan

    Institute of Engineering and Applied Sciences (PIEAS) or any other university/institute

    in the country/world.

    At any time if my statement is found to be incorrect (even after my graduation), the

    university has the right to withdraw my Ph.D. degree.

    ____________________

    (Wajid Arshad Abbasi)

    28 March, 2019

    PIEAS, Islamabad.

  • iv

    Plagiarism Undertaking

    I, Wajid Arshad Abbasi, solemnly declare that research work presented in the thesis

    titled “Investigating machine learning based prediction of protein interactions” is solely

    my research work with no significant contribution from any other person. Small

    contribution/help wherever taken has been duly acknowledged or referred and that

    complete thesis has been written by me.

    I understand the zero-tolerance policy of the Higher Education Commission (HEC) and

    Pakistan Institute of Engineering and Applied Sciences (PIEAS) towards plagiarism.

    Therefore, I, as an author of the thesis titled above declare that no portion of my thesis

    has been plagiarized and any material used as a reference is properly referred/cited.

    I undertake that if I am found guilty of any formal plagiarism in the thesis titled above

    even after the award of my Ph.D. degree, PIEAS reserves the rights to withdraw/revoke

    my Ph.D. degree and that HEC and PIEAS has the right to publish my name on the

    HEC / PIEAS Website on which name of students are placed who submitted plagiarized

    thesis.

    ____________________

    (Wajid Arshad Abbasi)

    28 March, 2019

    PIEAS, Islamabad.

  • v

    Copyrights Statement

    The entire contents of this thesis entitled Investigating Machine Learning Based

    Prediction of Protein Interactions by Wajid Arshad Abbasi are an intellectual

    property of Pakistan Institute of Engineering & Applied Sciences (PIEAS). No portion

    of the thesis should be reproduced without obtaining explicit permission from PIEAS.

  • vi

    Table of Contents

    Dedications......................................................................................................... ii

    Author’s Declaration ........................................................................................ iii

    Plagiarism Undertaking..................................................................................... iv

    Copyrights Statement ......................................................................................... v

    Table of Contents .............................................................................................. vi

    List of Figures .................................................................................................. xii

    List of Tables .................................................................................................. xix

    Acknowledgments ........................................................................................... xxi

    Abstract ........................................................................................................ xxiii

    List of Publications and Patents ..................................................................... xxv

    List of Abbreviations and Symbols ............................................................... xxvi

    1 Introduction ............................................................................................... 1

    1.1 Motivations........................................................................................... 3

    1.2 Problem Statement and Research Aims ............................................... 4

    1.3 Dissertation Organization and Chapters’ Digest .................................. 5

    2 Problem Formulation and Literature Survey ......................................... 8

    2.1 Proteins ................................................................................................. 8

    2.1.1 Protein Structures ........................................................................... 9

    2.1.2 Protein Functions .......................................................................... 10

    2.2 Protein Interactions and Complex Formation .................................... 12

    2.2.1 Binding Affinity of Interacting Proteins ...................................... 13

    2.2.2 Interfaces or Interaction Sites of Proteins .................................... 13

    2.2.3 Types of Protein Interactions and Complexes .............................. 14

    2.2.4 Biologically Significant Effects of Protein Interactions ............... 15

  • vii

    2.2.5 Problems of Interest in Protein Interactions ................................. 15

    2.3 Experimental Methods ....................................................................... 16

    2.4 Computational Methods ..................................................................... 17

    2.4.1 Classical Computational Methods ................................................ 18

    2.4.2 Machine Learning......................................................................... 21

    2.4.2.1 Protein Interaction Prediction .................................................... 21

    2.4.2.2 Protein Binding Affinity Prediction .......................................... 25

    2.4.2.3 Protein Interface or Interaction Site Prediction ......................... 26

    3 Issues in Host-Pathogen Protein Interaction Prediction ...................... 30

    3.1 Methods .............................................................................................. 33

    3.1.1 Datasets and Preprocessing .......................................................... 33

    3.1.1.1 Human-HIV Interaction Dataset (HH) ...................................... 34

    3.1.1.2 Human-Adenovirus Interaction Dataset (HA) .......................... 34

    3.1.2 Classifiers ..................................................................................... 34

    3.1.3 Feature Extraction ........................................................................ 36

    3.1.4 Model Evaluation ......................................................................... 37

    3.1.5 Performance Metrics .................................................................... 38

    3.2 Results and Discussion ....................................................................... 40

    3.2.1 Analysis of Evaluation Methodologies ........................................ 40

    3.2.2 Metrics for HPI Prediction ........................................................... 43

    3.3 Chapter Summary ............................................................................... 45

    4 CaMELS: Calmodulin Interaction Learning System .......................... 48

    4.1 Methods .............................................................................................. 49

    4.1.1 Dataset and Preprocessing ............................................................ 49

    4.1.1.1 CaM Interaction Site Dataset .................................................... 50

    4.1.1.2 CaM Interaction Dataset ............................................................ 50

    4.1.2 Machine Learning Models............................................................ 51

  • viii

    4.1.2.1 MIL Based CaM-Interaction Site Prediction ............................. 51

    4.1.2.3 Interaction Prediction ................................................................ 55

    4.1.3 Feature Extraction ........................................................................ 56

    4.1.3.1 Window Level Feature Representation ..................................... 56

    4.1.3.2 Protein Level Feature Representation ....................................... 58

    4.1.4 Performance Evaluation ............................................................... 59

    4.1.4.1 Evaluation of Interaction Prediction.......................................... 59

    4.1.4.2 Evaluation of Interaction Site Prediction .................................. 62

    4.1.5 Model Selection ............................................................................ 63

    4.1.6 Webserver ..................................................................................... 63

    4.2 Results and Discussion ....................................................................... 64

    4.2.1 Interaction Prediction ................................................................... 64

    4.2.1.1 Improved CaM Interaction Prediction ....................................... 66

    4.2.1.2 Motifs Search Fails to Predict CaM Interactions ...................... 68

    4.2.1.3 Importance of the Whole Protein Sequence .............................. 68

    4.2.1.4 GO Term Enrichment Analysis ................................................. 68

    4.2.1.5 Performance Evaluation on Validation Set ............................... 69

    4.2.1.6 In Silico Mutation Analysis ....................................................... 70

    4.2.1.7 Validation Through Wet-Lab Experiments ............................... 70

    4.2.1.8 Feature Analysis ........................................................................ 70

    4.2.2 Interaction Site Prediction ............................................................ 71

    4.2.2.1 Improved CaM Interaction Site Prediction ............................... 72

    4.2.2.2 Motifs Search Fails to Predict CaM Interaction Site ................. 73

    4.2.2.3 Performance Evaluation on Validation Set ............................... 74

    4.2.2.4 Validation Through Wet-Lab Experiments ............................... 76

    4.2.2.5 Contribution of Amino Acids and Motifs Identification ........... 76

    4.2.2.6 MIL Using SSGO Method ........................................................ 78

  • ix

    4.2.2.7 Analysis of Features in Interaction Site Prediction ................... 78

    4.3 Chapter Summary ............................................................................... 78

    5 ISLAND: In-Silico Protein Affinity Predictor ...................................... 80

    5.1 Methods .............................................................................................. 81

    5.1.1 Datasets and Preprocessing .......................................................... 82

    5.1.2 Evaluation of the PPA-Pred2 Webserver ..................................... 82

    5.1.3 Sequence Homology as Affinity Predictor ................................... 82

    5.1.4 Proposed Methodology................................................................. 83

    5.1.5 Sequence-Based Features ............................................................. 83

    5.1.5.1 Explicit Features ........................................................................ 83

    5.1.5.2 Kernel Representations.............................................................. 84

    5.1.6 Complex Level Features Representation ...................................... 85

    5.1.6.1 Feature Concatenation ............................................................... 86

    5.1.6.2 Combining Kernels.................................................................... 86

    5.1.7 Regression Models ....................................................................... 87

    5.1.7.1 Ordinary Least-Squares Regression (OLSR) ............................ 87

    5.1.7.2 Support Vector Regression (SVR) ............................................ 87

    5.1.7.3 Random Forest Regression (RFR) ............................................ 88

    5.1.8 Model Validation and Performance Assessment.......................... 88

    5.1.9 Webserver ..................................................................................... 88

    5.2 Results and Discussion ....................................................................... 89

    5.2.1 Binding Affinity Prediction Through Sequence Homology ........ 89

    5.2.2 Binding Affinity Prediction Through ISLAND ........................... 89

    5.2.3 Comparison Using External Independent Test Dataset ................ 90

    5.3 Chapter Summary ............................................................................... 91

    6 Learning Protein Binding Affinity Using Privileged Information ...... 93

    6.1 Methods .............................................................................................. 94

  • x

    6.1.1 Datasets and Preprocessing .......................................................... 94

    6.1.2 Proposed Approach ...................................................................... 95

    6.1.2.1 Baseline Classifiers ................................................................... 96

    6.1.2.2 LUPI-SVM ................................................................................ 97

    6.1.3 Feature Representation ................................................................. 99

    6.1.3.1 Sequence-Based Features ........................................................ 100

    6.1.3.2 Structure-Based Features ......................................................... 100

    6.1.4 Model Validation, Selection and Performance Assessment ....... 102

    6.1.5 Webserver ................................................................................... 103

    6.2 Results and Discussion ..................................................................... 104

    6.2.1 Performance of Baseline Learners ............................................. 104

    6.2.2 Performance of LUPI-SVM ....................................................... 105

    6.2.3 Evaluation Through Validation Dataset ..................................... 107

    6.2.4 Feature Analysis for Binding Affinity Prediction ...................... 108

    6.2.5 Learned Models Using LUPI and Classical SVM...................... 109

    6.3 Chapter Summary ............................................................................. 109

    7 PAIRpred: A Webserver for Protein Interface Prediction ............... 111

    7.1 Implementation................................................................................. 111

    7.2 Usage ................................................................................................ 113

    7.3 Results .............................................................................................. 114

    7.4 Validation Through Wet-Lab Experiments ...................................... 115

    8 Conclusions and Future Work ............................................................. 117

    8.1 Conclusions ...................................................................................... 117

    8.2 Future Work ..................................................................................... 119

    8.2.1 Application of Learning Using Privileged Information ............. 120

    8.2.2 Handling Data Sparsity in Protein Interaction Domain.............. 120

    Appendix A: Predictions Through CaMELS ............................................ 122

  • xi

    References ..................................................................................................... 128

  • xii

    List of Figures

    Figure 1.1 Central Dogma of Molecular Biology. Portion of DNA called

    a gene is transcribed to RNA which is used as a template to

    synthesize proteins during translation ................................... 1

    Figure 2.1 The chemistry of an amino acid (left panel) and properties of

    side chain (Right panel). Every amino acid has a carbon

    atom, called an alpha carbon (Cα), bonded to a carboxylic

    acid (–COOH) group, an amine (-NH2) group, a hydrogen

    atom, and an R group (side chain) that is unique for every

    amino acid. Physiochemical properties of amino acids are

    determined by the nature of its side chain .............................. 8

    Figure 2.2 Different levels of protein structure. Different amino acids

    joined together in various combinations through covalent

    bonds and form primary structure. Different sections of

    primary structure fold together through backbone hydrogen

    bonding and form alpha helix and beta sheets. Elements in

    secondary structure again fold through side chain interactions

    to from tertiary structure stabilized by ionic bonds, disulfide

    bonds, hydrophobic interactions, and hydrogen bonding.

    Protein quaternary structures are formed through interaction

    or binding of two or more independent tertiary structures ..... 9

    Figure 2.3 Protein Functions. Proteins perform their functions as

    enzymes (Sucrase), antibodies (T-cell receptor), messenger

    (Insulin), or structural component (Actin). The most

    fundamental function that proteins perform and which

    underpin all the other biochemical functions is their ability to

    bind or interact with other proteins or macromolecule .......... 11

    Figure 2.4 Protein Interaction. Two unbound proteins (Ligand and

    Receptor) with complementarity in shape and charge

    distribution interact with each other to form a protein 12

  • xiii

    complex. Interface of the complex at 6Å distance threshold

    is shown with sticks in magenta color ....................................

    Figure 2.5 Types of protein interactions and complexes. Protein

    Complexes are homomeric if one type of protein chains is

    involved in interactions otherwise if various type of protein

    chains are involved in complex formation then those

    complexes are called heteromeric. Further protein complexes

    are divided into stable or transient based on the duration of

    interactions. Binding affinity is a measure of the strength of

    interaction between the protein involved in a complex

    formation. Binding affinity is measured in terms of

    disassociation constant (𝐾𝑑) and binding affinity is high for

    low 𝐾𝑑 values. Stable complexes have high and weak

    transient have low binding affinity ........................................ 14

    Figure 2.6 Experimental methods to determine protein interactions,

    binding affinity, and interaction site or interface. .................. 17

    Figure 2.7 Classical Computational methods to predict protein

    interactions, binding affinity, and interaction site or interface

    of a protein complex. .............................................................. 18

    Figure 2.8 Classical Computational methods for protein interaction,

    binding affinity and interface prediction. (a) Interolog search;

    (b) Docking. ........................................................................... 19

    Figure 2.9 A general framework for developing machine learning

    models for PPIs, binding affinity and interface prediction. ... 22

    Figure 2.10 Machine learning methods for protein interactions, binding

    affinity, and interface or interaction site prediction. .............. 27

    Figure 3.1 A general framework of machine learning models used to

    predict the host-pathogen protein interactions (HPIs)............ 31

    Figure 3.2 A comparison of two different cross-validation (CV)

    schemes on a toy dataset. K-fold (shown in left panel) and

    Leave One Pathogen Protein Out (LOPO) (shown in right

    panel). In both evaluation protocols, number of folds is equal

    to the number of pathogen proteins in toy dataset. In K-fold 33

  • xiv

    CV folds are created randomly while in LOPO folds are

    created with respect to pathogen proteins. Overlap of data

    occurs using K-fold CV for both host and pathogen proteins

    e.g., proteins 𝑝1and ℎ1 occur in both train and test sets in each

    fold. Whereas, by using LOPO CV overlap vanishes with

    respect to pathogen proteins ...................................................

    Figure 3.3 Precision-recall curves obtained through K-fold and LOPO

    cross-validation. (a-d) Human-HIV and (e-h) Human-

    Adenovirus interaction datasets. Mean area under the curves

    across folds along with standard deviation is shown in

    parenthesis .............................................................................. 41

    Figure 3.4 Receiver operating characteristic (ROC) curves obtained

    through K-fold and LOPO cross-validation. (a-d) Human-

    HIV and (e-h) Human-Adenovirus interaction datasets. Mean

    area under the curves across folds along with standard

    deviation is shown in parenthesis ........................................... 43

    Figure 3.5 Radar plots of the area under th ROC curve (AUC-ROC )

    using two different cross-validation schemes for all models . 45

    Figure 3.6 Radar plots of the area under the precision-recall curves

    (AUC-PR) using two different cross-validation schemes for

    all models ............................................................................... 46

    Figure 4.1 MIL Framework for CaM interaction site prediction. The

    protein sequence 𝑝 is represented as a line while the

    annotated CaM interaction site as a box. All overlapping

    windows with the annotated interaction site in 𝑝 constitute

    positive examples (𝐵𝑝) and the rest of the windows constitute

    negative examples (𝑁𝑝). The score obtained from the trained

    discriminant function 𝑓(𝑥) should be higher for at least one

    positive example than the scores generated for all negative

    examples in 𝑝 ......................................................................... 50

    Figure 4.2 MIL training algorithm with SSGO for CaM interaction site

    prediction ........................................................................................ 53

  • xv

    Figure 4.3 The online user interface for CaMELS webserver. (a) This

    webserver accepts FASTA file or plain sequence of a protein

    for CaM interaction and interaction site prediction; (b)

    Interaction prediction model; (c) Interaction site prediction

    model ...................................................................................... 64

    Figure 4.4 (a) Receiver Operating Characteristic (ROC); (b) Precision-

    recall (PR) curves for CaM interaction prediction for all

    models. The averaged area under the curve across folds is

    shown in parenthesis .............................................................. 65

    Figure 4.5 (a) Precision-recall (PR) curves showing a comparison of

    CaMELS with MI-1 and iLoops. The averaged area under the

    curve across folds is shown in parenthesis; (b) Violin plot

    showing density distributions of scores for positive (CaM

    interacting) and negative (non-interacting) proteins

    generated through DFS and CaMELS. Dotted lines show

    density quartiles ..................................................................... 66

    Figure 4.6 (a) Receiver Operating Characteristic (ROC) curves; (b)

    Precision-recall (PR) curves; (c) ROC0.1 curves; (d) RFPP

    curves for CaM interaction site prediction across different

    models. The averaged area under the curves across folds is

    shown in parenthesis .............................................................. 71

    Figure 4.7 Predicted interaction sites of complexes of proteins with

    CaM in the validation dataset through CaMELS. Calmodulin

    (CaM) (grey with light shade); CaM interaction protein (grey

    with dark shade); The predicted central residue of the

    interaction site (sphere); Residues of the CaM interacting

    protein within 5Å of CaM (stick form). (a) PDB ID: 1NWD;

    (b) PDB ID: 1SY9; (c) PDB ID: 2M0K; (d) PDB ID: 5DOW

    (e) PDB ID: 1YRT ................................................................. 74

    Figure 4.8 Interaction site prediction score through CaMELS for

    proteins used in mutagenic studies. Location of the predicted

    interaction site has been denoted with a red dot. (a) LCa; (b)

    SGS3 of Nicotiana Benthamiana ........................................... 75

  • xvi

    Figure 4.9 Learned weight vectors of classifiers during training

    CaMELS. (a) Weights obtained during training using AAC

    feature representation; (b) Heat map of the weights obtained

    during training using PDC feature representation; (c) Top 50

    motifs learned during training using PDGT feature

    representation. Actual weight value learned for each feature

    during training is shown in the numeric column .................... 77

    Figure 5.1 A general framework for protein affinity prediction using

    machine learning techniques .................................................. 81

    Figure 5.2 Techniques adopted for generating sequence-based feature

    representation of a protein complex for developing machine

    learning based protein binding affinity prediction models .... 86

    Figure 5.3 The online user interface for ISLAND webserver. A user can

    submit pair of plain sequence of proteins of interest for

    binding affinity prediction ...................................................... 89

    Figure 5.4 Cumulative histogram of absolute error between actual and

    predicted binding affinity values through ISLAND and PPA-

    Pred2 on external independent validation dataset .................. 91

    Figure 6.1 A framework to classify protein complexes based on their

    binding affinities through the paradigm of learning using

    privileged information (LUPI). Privileged information (3D

    structural information) is only required at training time (left

    panel) to help better performance at test time (right panel)

    using sequence information) alone ......................................... 94

    Figure 6.2 Learning using privileged information with stochastic sub-

    gradient optimization training ................................................ 99

    Figure 6.3 Number of interacting residue pairs (NIRP) in the interface

    of a protein complex. The frequency of non-repeating pairs

    (considering A: B and B: A same) was computed from the

    bound 3D structures of ligand (L) and receptor (R) of a

    protein complex. Residues (shown as spheres) at a distance

    cutoff of 8 angstroms (Å) are considered the interface of the 101

  • xvii

    complex. The bottom panel of the figure shows the form of

    feature vector extracted through this scheme .........................

    Figure 6.4 The online user interface for LUPI-SVM webserver. (a) A

    user can submit pair of plain sequences of proteins of interest

    for binding affinity prediction; (b) An elucidation of

    predicted score ....................................................................... 103

    Figure 6.5 (a) ROC and (b) PR curves showing a performance

    comparison between LUPI-SVM (with 2-mer as input and

    Moal Descriptors as privileged feature space) and the

    baseline classifiers (XGBoost, classical SVM (SVM), and

    Random Forest (RF) with 2-mer features) on the affinity

    benchmark dataset. The average area under the ROC and PR

    curve (AUC) is shown in parenthesis ..................................... 104

    Figure 6.6 Feature analysis using SHAP. The impact of 2-mer features

    on model output is shown using SHAP values. The plot

    shows the top 20 2-mers for the Ligand (L) or Receptor (R)

    by the sum of their SHAP values over all samples. Feature

    value is shown in color (Red: High; Blue: Low) reveals for

    example that a high value of L (EK) (Counts of ‘EK’ in a

    protein sequence designated as a ligand) contributes more for

    predicting low binding affinity complexes ............................ 108

    Figure 6.7 Weight vectors of the trained classifiers for the ligand

    Blosum features. (a) SVM with LUPI framework using

    Blosum (Protein) as input and Moal Descriptors as privileged

    feature space; (b) Classical SVM using Blosum (Protein)

    features ................................................................................... 109

    Figure 7.1 Flowchart of the PAIRpred webserver. PAIRpred takes a pair

    of proteins in PDB or FASTA format. Upon successful

    format validation, PAIRpred performs chain selection and

    feature extraction from the given sequences or structures.

    Extracted features are used to generate predictions from a

    pre-trained SVM classifier. These prediction results are

    available for download and as an email attachment. .............. 112

  • xviii

    Figure 7.2 Web interface of the PAIRpred webserver. (a) Home page

    with user input and files upload options; (b) Chain selection;

    (c) job submission notification and view results options ....... 113

    Figure 7.3 Figure 7.3. Input pdb files with modified B factor. B factors

    of the Ligand pdb file are replaced with 'Ligand scores' and

    of the receptor pdb file with 'Receptor scores' ....................... 114

    Figure 8.1 Machine learning techniques to handle sparsity in labeled

    training data. SMEs: Subject Matter Experts; LUPI: Learning

    Using Privileged Information; MIL: Multiple Instance

    Learning; GANs, Generative Adversarial Networks ............. 120

  • xix

    List of Tables

    Table 3.1 Proposed biologist centric metrics to assess the generalization

    performance of HPIs predictors over LOPO cross-validation

    across all models ................................................................... 47

    Table 4.1 Results showing the performance of CaMELS in comparison

    to DFS for CaM Interaction prediction for all models .......... 67

    Table 4.2 Results showing performance of CaMELS and MI-1 via Gene

    Ontology term enrichment analysis .......................................... 69

    Table 4.3 Results showing the performance of CaMELS for CaM

    interaction site prediction in comparison to SVM (baseline),

    mi-SVM and MI-1 across all models. AUCPR was unavailable

    for SVM (baseline), mi-SVM and MI-1 ................................... 73

    Table 5.1 Evaluation of PPA-Pred2 through its webserver on affinity

    benchmark dataset 2.0 ............................................................... 90

    Table 6.1 Protein complex classification results obtained using classical

    SVM, Random Forest and XGBoost using input and privileged

    features with LOCO cross-validation over the affinity

    benchmark dataset ..................................................................... 105

    Table 6.2 Protein complex classification results obtained through

    classical SVM and LUPI across different features using LOCO

    cross-validation over the affinity benchmark dataset ................ 106

    Table 6.3 Comparison of classical SVM and LUPI-SVM on the external

    independent validation dataset with training on affinity

    benchmark dataset ..................................................................... 107

    Table A1 Top 241 predicted CaM binding proteins from the proteome of

    A. thaliana through CaMELS along with their predicted

    interaction sites .......................................................................... 122

    Table A2 241 CaM binders from interaction dataset along with predicted

    binding sites through CaMELS ................................................. 124

  • xx

    Table A3 List of 250 proteins used as negative set in the independent

    validation dataset ....................................................................... 125

  • xxi

    Acknowledgments

    After humbly thanking Allah Almighty, I want to express my gratitude to those who

    helped me to conduct this research work and enable me to complete my PhD: First, I

    am very thankful to my adviser, Dr. Fayyaz Ul Amir Afsar Minhas for his time,

    devotions, encouragements, motivations, guidance, support, and valuable discussions

    that helped me to define and execute my thesis research. During the four years of my

    Ph.D., we spent a significant amount of time during meetings, discussions,

    brainstorming and presentation sessions. He always tried to deliver every bit of

    knowledge to me and I always found myself much more relaxed, focused and motivated

    after every meeting with him. I wish to continue him as mentor and guide in my rest of

    the life.

    I am also grateful to the members of my research committee: Dr. Sikander Majid

    Mirza and Dr. Asifullah Khan for their feedback on the research proposal. I also want

    to acknowledge the support and guidelines of our collaborators Prof. Dr. Asa Ben-Hur,

    Colorado State University, USA and Dr. Imran Amin, Principal Scientist, NIBGE. I am

    also thankful to other faculty members of the Department of Computer and Information

    Sciences, PIEAS: Dr. Abdul Jalil, Dr. Mutawarra Hussain, Dr. Anila Usman, Dr. Abid

    Mughal, Dr. Javaid Khurshid, Dr. Naeem Akhtar, Dr. Shahzad Ahmad Qureshi for their

    support and specially to Dr. Muhammad Hanif Durad for providing me seating place in

    his lab. I also want to extend my sincere gratitude to my lab fellows: Amina Asif, Sadaf

    Khan, Adiba Yaseen, Kanza Hamid, Abdul Hanan Basit, Bismillah Jan, Fahad ul

    Hassan, Asif Khan, Muhammad Dawood and Hira Kamal for their support and help. I

    would never forget those wonderful hikes and parties which I had with them during my

    Ph.D. studies. I am also indebted to my other friends at PIEAS: Muhammad Imran,

    Naveed Akhtar, Mohsin Sittar, Naveed Chohan, Noorul Wahab, Faheem Afsar,

    Muhammad Bashir, and Mirqad Ayaz for their cooperation and moral support. If I

    missed your name on this list and you think it belongs here, I apologize.

    I would also like to express my gratitude towards my family, especially my

    parents (Arshad Habib Abbasi and Safia Shaheen), Uncles (Arif Habib, Sardar Imtiaz

  • xxii

    Abbasi, Tariq Habib, M. Shafiq Abbasi and Abid), Aunties (Razia Shaheen, Shaheen,

    Balqees and Razia Sultan), brothers (Amjid, Badar Munir, Nayyer, Waseem, Asad Ali,

    Zohaib, Umer, Rizwan, Shujahat Ali and Abdullah) and sisters (Fozia Arshad, Qudsia,

    Fozia Aziz, Kiren, Uzma, Rubi, Faiza, Maryam and Nida) for their care, love and good

    wishes. I would greatly acknowledge the support, encouragement, care, love, and

    patience of my beloved wife Dr. Saiqa Andleeb. Without her support, it would be

    impossible for me to complete my doctorate. Also, thanks to my daughter Zarnish

    Habib, niece and nephews (Ahmed, Ahsan, Ayan Mansoor, Mahrosh Habib, Zain, Esa,

    Aryan Habib, Hashim Habib, and Mohid Habib) all of you are the reason for me to keep

    going.

    I must also extend my thanks to Muhammad Sadique Awan, Shabir Ahmed

    Abbasi and Imran Abbasi at the University of Azad Jammu and Kashmir for their

    support in official matters.

    Lastly, I would like to acknowledge the Higher Education Commission (HEC)

    of Pakistan for funding my Ph.D. studies via a grant (PIN: 213-58990-2PS2-046) under

    indigenous 5000 Ph.D. fellowship scheme. I am also thankful for providing me funds

    under the International Research Support Initiative Program (IRSIP) to pursue my

    Ph.D. research work at the Colorado State University (CSU), USA. My primary reason

    to thank this scholarship is the fact that it provided me with an opportunity to expand

    my horizons and knowledge.

    Wajid Arshad Abbasi

  • xxiii

    Abstract

    Protein interactions are crucial in the cell for performing cellular functions and the study

    of protein interactions is a very important domain of research in bioinformatics. In

    reference to protein interactions, biologists are usually interested in three core

    problems: determining pairwise protein interactions, determination of binding affinity,

    and identification of the interface. Computational methods to solve these protein

    interaction problems have emerged as an active research area due to tedious, costly, and

    time-consuming experimental procedures. Our aim in this work is to develop novel

    machine learning based methods for protein interaction, binding affinity and interaction

    prediction with improved generalization performance.

    In this dissertation, we have developed host-pathogen protein interaction predictors

    using machine learning. One of our findings is that existing methods for protein

    interaction prediction that use K-fold cross-validation for performance assessment

    report over-estimated accuracy values as K-fold cross-validation does not take pairwise

    protein similarity between training and test examples into account. To control this data

    redundancy at pathogen protein level, we have proposed and advocated the use of an

    alternate evaluation scheme called Leave One Pathogen Protein Out (LOPO) cross-

    validation along with some biologist centric metrics for designing protein-protein

    interaction prediction methods.

    We have also designed a novel machine learning model called CaMELS (CalModulin

    intEraction Learning System) for interaction and interaction site prediction of

    Calmodulin (CaM) which is a very important and highly conserved protein across all

    eukaryotes. CaMELS relies on a novel implementation of multiple instance learning

    solver for protein binding site prediction that leads to significant improvement in

    predictive performance. One of our collaborators has confirmed the effectiveness of

    CaMELS through wet-lab experiments as well.

    We have also focused on the more generic problem of predicting binding affinity in

    protein interactions and presented various sequence-based machine learning models.

  • xxiv

    For this purpose, we have developed a novel machine learning method which is based

    on the framework of Learning Using Privileged Information (LUPI). Our state-of-the-

    art method uses protein 3D structure as privileged information at training time while

    expecting only protein sequence information during testing. This makes our machine

    learning method flexible by allowing it to leverage protein structure information during

    training while requiring only protein sequence information during testing.

    We have also developed a webserver for an existing state-of-the-art protein-protein

    interface prediction method called PAIRPred. The accuracy of this webserver has also

    been validated by our collaborators through wet-lab experiments as well.

  • xxv

    List of Publications and Patents

    Journal Publications

    Wajid Arshad Abbasi, Amina Asif, Asa Ben-Hur and Fayyaz ul Amir Afsar

    Minhas, “Learning Protein Binding Affinity using Privileged Information”,

    BMC Bioinformatics, vol. 19, 425, 2018.

    Abdul Hanan Basit, Wajid Arshad Abbasi, Amina Asif and Fayyaz ul Amir

    Afsar Minhas, “Training Large Margin Host-Pathogen Protein-Protein

    Interaction Predictors”, Journal of Bioinformatics and Computational Biology,

    vol. 16, 18500142, 2018.

    Wajid Arshad Abbasi, Amina Asif, Saiqa Andleeb and Fayyaz ul Amir Afsar

    Minhas, “CaMELS: In silico prediction of calmodulin binding proteins and their

    binding sites”, Proteins: Structure, Function and Bioinformatics, vol. 85 (9), pp.

    1724–1740, 2017.

    Wajid Arshad Abbasi and Fayyaz ul Amir Afsar Minhas, “Issues in

    performance evaluation for host–pathogen protein interaction prediction”,

    Journal of Bioinformatics and Computational Biology, vol. 14 (3), 1650011,

    2016.

    Conference Publications

    Adiba Yaseen, Wajid Arshad Abbasi and Fayyaz ul Amir Afsar Minhas,

    “Protein binding affinity prediction using support vector regression and

    interfecial features”, 15th International Bhurban Conference on Applied

    Sciences and Technology (IBCAST), IEEE, 2018, pp. 194-198.

    Kanza Hamid, Amina Asif, Wajid Arshad Abbasi, Durre Sabih and Fayyaz ul

    Amir Afsar Minhas, “Machine Learning with Abstention for Automated Liver

    Disease Diagnosis”, in Proceedings of the 15th International Conference on

    Frontiers of Information Technology, IEEE, 2017, pp. 356-361.

    Preprints

    Wajid Arshad Abbasi, Fahad Ul Hassan, Adiba Yaseen, Fayyaz Ul Amir Afsar Minhas. “ISLAND: In-Silico Prediction of Proteins Binding Affinity

    Using Sequence Descriptors”, arXiv:1711.0540.

    Amina Asif, Wajid Arshad Abbasi, Farzeen Munir, Asa Ben-Hur, and Fayyaz ul Amir Afsar Minhas. “pyLEMMINGS: Large Margin Multiple

    Instance Classification and Ranking for Bioinformatics Applications”,

    arXiv:1711.04913.

  • xxvi

    List of Abbreviations and Symbols

    𝑷𝒓 Pearson Correlation Coefficient

    3-D Three Dimensional

    AAC Amino Acid Composition

    AUC Area under the ROC Curve

    AUC-PR Areas Under the Precision-Recall Curve

    AUC-ROC Areas Under the ROC curve

    BIP-BIANA Biologic Interactions and Network Analysis

    BLOSUM Blocks Substitution Matrix

    CaM Calmodulin

    CaMELS Calmodulin Interaction Learning System

    CV Cross-Validation

    DFS Discriminant Function Scoring

    DNA Deoxyribonucleic Acid

    FHR False Hit Rate

    FP Fluorescence Polarization

    GO Gene Ontology

    HIV Human Immunodeficiency Virus

    HPIs Host-Pathogen Interactions

    HTS High-Throughput Sequencing

    IR Insulin Receptor Tyrosine Kinase

    ISLAND In-Silico Protein Affinity Predictor

    ITC Isothermal Titration Calorimetry

    JBCB Journal of Bioinformatics and Computational Biology

    LOCO Leave One Complex Out

    LOPO Leave One Pathogen Protein Out

    LUPI Learning Using Privileged Information

    MDS Molecular Dynamic Simulation

    MIL Multiple Instance Learning

    MRFPP Median Rank of the First Positive Prediction

  • xxvii

    NCBI National Center for Biotechnology Information

    NIRP Number of Interacting Residue Pairs

    NMR Nuclear Magnetic Resonance

    OLSR Ordinary Least-Squares Regression

    PAIRpred Partner Aware Interacting Residue Predictor

    PD-Blosum Position Dependent BLOSUM-62

    PDC Position Dependent Composition

    PDGT Position Dependent Gappy Triplet

    PHISTO Pathogen-Host Interaction Search Tool

    PPIs Protein-Protein Interactions

    PR Area Under the Precision-Recall Curve

    PseAAC Pseudo-Amino Acid Compositions

    PSFMs Position Specific Frequency Matrices

    PSSMs Position Specific Scoring Matrices

    RB Retinoblastoma

    RF Random Forest

    RFPP Rank of the First Positive Prediction

    RFR Random Forest Regression

    RMSE Root Mean Squared Error

    RNA Ribonucleic Acid

    ROC Receiver Operating Characteristic

    SPR Surface Plasmon Resonance

    SSGO Stochastic Sub-Gradient Optimization

    SVM Support Vector Machines

    SVR Support Vector Regression

    TAP Tandem Affinity Purification

    THR True Hit Rate

    UniProt Universal Protein Knowledgebase

    WHO World Health Organization

    XGBoost Extreme Gradient Boosting

    Y2H Yeast Two-Hybrid

  • 1

    1 Introduction

    In order to better understand the complexity of life and mechanism of biological systems,

    we need to analyze dynamic interactions of these biological systems at the molecular level

    [1]. In all living organisms, the cell is the fundamental unit of life and it is composed of

    different molecules which perform all life-sustaining functions [2]. There are three

    molecules in a cell that are primarily responsible for sustaining life: deoxyribonucleic acid

    (DNA), ribonucleic acid (RNA), and proteins. These three molecules function under the

    principle of Central Dogma of Molecular Biology [3]. This dogma operates in two steps:

    first, a portion of DNA called a gene is copied to form a messenger RNA (transcription)

    and then the messenger RNA is used as a template to synthesize proteins (translation) as

    shown in Fig. 1.1. Proteins are the key molecules which perform almost all the biologically

    significant functions at cellular level.

    Figure 1.1. Central Dogma of Molecular Biology. Portion of DNA called a

    gene is transcribed to RNA which is used as a template to synthesize proteins

    during translation.

  • Introduction

    2

    Functionally, proteins are the dominant player and second most abundant

    biomolecule present in the cell after water. The importance of proteins in our body can be

    appreciated by considering the fact that 50% of the dry weight of the human body is protein

    [4]. Proteins perform their functions in different forms such as pepsin helps in digestion as

    an enzyme, insulin controls blood sugar as a hormone, calmodulin (CaM) affects

    intracellular signaling, hemoglobin transports oxygen, histones play a role in gene

    regulation, antibodies combat infectious diseases and many more [5]. Considering this

    huge functional diversity of proteins, it is important to decipher their working mechanism

    in order to completely understand cellular behavior.

    Up to the 1970s, it was widely believed that a single protein performs a single

    function under a dogma called ‘one gene/one enzyme/one function’ and protein

    interactions were considered as purification artifacts [6], [7]. The idea of single protein-

    single function was ultimately shown to be incorrect with the discovery of the involvement

    of multiple proteins in DNA replication (e.g., DNA helicase, DNA primase) besides the

    polymerase and the participation of more than 20 proteins in the protein import into

    mitochondria [7]–[9]. Now, it is an established fact that proteins do not function in isolation

    and more than 80% of all cellular proteins perform their biological functions by forming

    complexes through protein-protein interactions (PPIs) [10], [11]. Proteins achieve their

    functional diversity through interactions and such interactions mediate overall organismal

    systems including metabolic pathways and cell-to-cell interactions [12]. Because of their

    dominant role in biological processes, protein interactions are normally responsible for

    healthy or diseased states in an organism. For example, the retinoblastoma (RB) protein is

    a tumor suppressor which prevents abnormal cell division by binding to the E2F

    transcription factor. When this interaction gets perturbed due to the absence or mutation of

    RB protein, E2F will be freely available for unrestrained cell division and formation of the

    tumor [13]. Therefore, understanding the protein-protein interactions is crucial to know the

    basic cellular biology, functions of a previously uncharacterized protein, and the disease

    mechanisms. Moreover, knowledge about protein interactions is also important in

    therapeutics to develop effective and personalized drugs with fewer side effects because

    more than 80% of current therapeutic targets are proteins [5].

  • Introduction

    3

    Protein-protein interactions (PPIs) are the noncovalent physical connections

    established between amino acids in the 3-D structures of two or more proteins at the

    specific locations called binding/interaction sites. Physiochemically complementary

    protein interactions are normally steered by the hydrogen bonding, electrostatic forces, and

    hydrophobic effects [14]. Biologists are normally interested in solving the following three

    main challenging problems related to PPIs.

    a) Pairwise Protein Interactions: Whether two given proteins interact or not?

    b) Binding Affinity: Strength of the interaction.

    c) Interface or Interaction Site: Exact location of the interaction.

    In this work, we have developed machine learning based computational methods

    which would assist biologists in the wet lab for solving the aforementioned protein

    interaction related challenges.

    1.1 Motivations

    The field of biology experienced two important conceptual shifts in the 20th century with

    the discovery of Mendel’s laws and restriction enzymes [15], [16]. In the meanwhile,

    complete sequencing of the human genome in 2001 [17] and emerging high-throughput

    sequencing (HTS) technologies empowered biologists to think of complex biological

    questions in terms of molecules. Raw data of DNA and protein sequences is growing at an

    exponential rate in different databases such as GenBank [18] and Universal Protein

    Knowledgebase (UniProt) [19]. The major challenge now is to analyze this large amount

    of data as most of the genes and proteins sequenced are of unknow functions and un-

    characterized [20]. The task of analyzing data in proteomics is further complicated because

    of the involvement of complex protein interactions. This problem creates a wide scope for

    researchers in the field of computer science and bioinformatics to analyze data in-silico

    and assist biologists in solving interesting biological problems.

    Moreover, in the study of protein interactions called interactomics, experimental

    methods are often laborious, time-consuming and expensive, making it difficult to

    investigate all possible protein interactions within and across organisms. For instance, the

    bacterium Bacillus anthracis has 5,508 protein-encoding genes [21], which when paired

  • Introduction

    4

    with the 20,000 or so human genes [22], gives more than 100 million possible protein

    interaction pairs to validate experimentally. It is not practical to verify all possible

    interaction pairs through wet-lab experiments. Therefore, there is an extreme need for

    computational approaches to support wet-lab methods by predicting and ranking probable

    PPIs. Such computational approaches can assist biologists in focusing on the most likely

    interactions.

    1.2 Problem Statement and Research Aims

    Among computational approaches, application of machine learning techniques to

    bioinformatics for the prediction of PPIs is a well-accepted idea. In machine learning based

    predictors of PPIs, models are normally built by using sequence and structure information

    of protein interactions which have been discovered through experimental methods.

    Unavailability of the structural information of most of novel proteins limits the practical

    use these predictors and therefore sequence information is the only practical choice.

    Prediction of PPIs through machine learning using sequence data only is a challenging

    problem because proteins in real interact in a specific three-dimensional conformation.

    Moreover, in proteins, involvements of significant conformational changes, motion and

    flexibility, alternate binding modes, the dependence of binding propensity on the binding

    partner, and uncertainties in the annotation of available scientific data make the problem

    of the prediction of PPIs hard. In every machine learning setting, it is vital to thoroughly

    understand the nature of the problem, availability and amount of training data and the

    prospective use of the system while designing the predictive model, its evaluation

    methodology, and the performance metrics. However, this requirement is even more

    crucial in bioinformatics in comparison to other application areas because of its role as a

    tool for biological discovery. Also, with the growth in proteome data of different

    organisms, there is a pressing need for PPIs predictors which can incorporate more and

    more information from different sources to gain better generalization. To achieve these

    goals, we have formulated and accomplished the following research aims in this study.

    We have performed a survey of existing machine learning based computational

    techniques for protein-protein interaction prediction in order to assess the suitability

  • Introduction

    5

    and limitations of existing evaluation protocols and performance metrics for this

    purpose.

    We have designed sequence-based machine learning models of PPIs for interaction,

    interaction site, and affinity prediction with improved generalization accuracy by

    incorporating learning data at the proteome level and combining information of

    interactions and interaction sites.

    Generally, computational techniques exploit protein sequence and 3-D structural

    information to develop predictive models for protein interactions. One of the major

    issues with techniques using protein 3-D structural information which limits their

    applicability is the unavailability of solved 3-D structures of novel proteins. Our

    aim in this study is also to design such a machine learning based model for PPIs

    prediction which can use both protein structural and sequence information during

    training but it only requires sequence information for testing.

    1.3 Dissertation Organization and Chapters’ Digest

    This dissertation is divided into the following chapters.

    In chapter 2 “Problem Formulation and Literature Survey”, we give the

    required background of proteins, their structure, functions, and interactions along with

    experimental procedures of determining these interactions. We also perform a literature

    survey of existing computational techniques of predicting protein-protein interactions

    (PPIs), interfaces/interaction sites, and protein binding affinity along with the formulation

    of these problems as machine learning problems.

    In chapter 3 “Issues in Host-Pathogen Protein Interaction Prediction”, we have

    performed a survey of existing machine learning based host-pathogen protein interactions

    (HPIs) prediction techniques. The objective of the survey was to assess the suitability and

    limitations of existing evaluation protocols and performance metrics to design predictive

    HPI models. In this chapter, we have investigated the usefulness of K-fold cross-validation

    for evaluating the generalization performance of pairwise protein interaction predictors in

    host-pathogen interactions (HPIs). K-fold cross-validation does not avoid redundancy

    between the train and test data and results in an inflated accuracy. To control this data

  • Introduction

    6

    redundancy at pathogen protein level, we have proposed and shown the effectiveness of a

    new evaluation scheme called Leave One Pathogen Protein Out (LOPO) cross-validation.

    We have also proposed and suggested the use of some biologist-centric metrics for HPIs

    predictors. Our findings of this study have been published in the Journal of Bioinformatics

    and Computational Biology (JBCB), 14(3), 2016, 1650011.

    In chapter 4 “CaMELS: Calmodulin Interaction Learning System”, we present

    a machine learning based algorithm suite called CaMELS (CalModulin intEraction

    Learning System) for CaM interaction and interaction site prediction using sequence

    information alone. CaMELS models CaM interaction and interaction site prediction as two

    separate classification problems and gives state-of-the-art accuracy for both tasks. To

    predict CaM interaction, CaMELS uses traditional support vector machine (SVM) along

    with features extracted from the whole sequence of a protein instead of localized window

    level features. Whereas, for solving CaM interaction site prediction problem CaMELS used

    Multiple Instance Learning (MIL) paradigm to handle imprecisions in binding site

    annotations in the training data. To solve the multiple instance machine learning model,

    CaMELS used a custom-built algorithm based on stochastic sub-gradient optimization

    (SSGO) that allows more fast and effective learning. We have shown improved

    generalization performance of CaMELS using a variety of evaluation techniques including

    wet-lab experiments. Python code for training and evaluating CaMELS together with a

    webserver implementation is available at the URL:

    http://faculty.pieas.edu.pk/fayyaz/software.html#camels. We have published the outcomes

    of this work in Proteins: Structure, Function, and Bioinformatics, 85(9), 2017, 1724–1740.

    In chapter 5 “ISLAND: In-Silico Protein Affinity Predictor”, sequence-based

    protein binding affinity prediction methods using machine learning have been explored.

    Specifically, we present our findings that the true generalization performance of even the

    state-of-the-art sequence-only predictor is far from satisfactory and that the development

    of machine learning methods for binding affinity prediction with improved generalization

    performance is still an open problem. We have also proposed a sequence-based novel

    protein binding affinity predictor called ISLAND which gives better accuracy than existing

    methods over the same validation set as well as on external independent test dataset. A

    http://faculty.pieas.edu.pk/fayyaz/software.html#camels

  • Introduction

    7

    cloud-based webserver implementation of ISLAND and its python code are available at

    http://faculty.pieas.edu.pk/fayyaz/software.html#island.

    In chapter 6 “Learning Using Privileged Information for Protein Binding

    Affinity Prediction”, we have developed a novel machine learning method for predicting

    binding affinity which is based on the framework of learning using privileged information

    (LUPI). This method uses protein 3D structure as privileged information at training time

    while expecting only protein sequence information during testing. The proposed method

    outperforms several baseline learners and a state-of-the-art binding affinity predictor not

    only in cross-validation but also on an additional validation dataset. This demonstrates the

    utility of the implemented LUPI framework developed for this work in other areas of

    bioinformatics as well. A Python implementation of the proposed method together with a

    webserver is available at http://faculty.pieas.edu.pk/fayyaz/software.html#LUPI. The

    outcomes of this work have been accepted for publication in the BMC Bioinformatics

    journal.

    In chapter 7 “PAIRpred: A Webserver for Partner-Aware Protein Interface

    Prediction”, a web server has been developed and deployed for PAIRpred which is a state-

    of-the-art method for predicting partner-specific interface of a protein complex using either

    sequence information alone or in conjunction with features derived from the unbound

    structures of the two proteins in the complex. The web server is available at

    http://faculty.pieas.edu.pk/fayyaz/software.html#pairpred. This webserver takes a pair of

    proteins in fasta or pdb format and produces downloadable predictions along with

    highlighted predicted interface in its output PDB files.

    In chapter 8, “Conclusions and Future Work”, we have summarized the

    conclusions drawn from this study along with the details of the projects to be completed in

    future.

    http://faculty.pieas.edu.pk/fayyaz/software.html#islandhttp://faculty.pieas.edu.pk/fayyaz/software.html#LUPIhttp://faculty.pieas.edu.pk/fayyaz/software.html#pairpred

  • 8

    2 Problem Formulation and Literature Survey

    In this chapter, we start with a brief introduction of proteins and their characteristics to

    assist the reader with relevant biological background. Then, we discuss interesting

    biological problems in protein interactions, along with experimental and computational

    methods of solving these problems. Further, we formulate biological questions in

    protein-protein interaction as machine learning problems and perform a literature

    survey of known machine learning techniques in this domain to highlight important

    research questions.

    2.1 Proteins

    Proteins are the second most abundant macromolecules present in a cell [5]. Proteins

    are made up of smaller units called amino acids. There are twenty naturally accruing

    amino acids which are considered as the raw material of all proteins. An amino acid is

    an organic compound containing amine (-NH2) and carboxyl (-COOH) group together

    with a side chain functional group as shown in Fig. 2.1 [14]. Every amino acid has a

    unique functional group attached to it. Different amino acids have different

    physiochemical properties (see Fig. 2.1) and are encoded by codons in the

    Deoxyribonucleic acid (DNA) through a linear relationship [14]. These 20 amino acids

    are linked with each other through peptide bonds in various combinations to form

    Figure 2.1. The chemistry of an amino acid (left panel) and properties of side

    chain (Right panel). Every amino acid has a carbon atom, called an alpha carbon

    (Cα), bonded to a carboxylic acid (–COOH) group, an amine (-NH2) group, a

    hydrogen atom, and an R group (side chain) that is unique for every amino acid.

    Physiochemical properties of amino acids are determined by the nature of its side

    chain.

  • Problem Formulation and Literature Survey

    9

    proteins with diverse structures and functions. Proteins vary in length from a hundred

    to thousands of amino acids.

    2.1.1 Protein Structures

    Proteins have four level of their structures: primary, secondary, tertiary, and quaternary.

    These different levels of protein structures are shown in Fig. 2.2. Proteins are also called

    polypeptides where different amino acids are joined together in various combinations

    through peptide bonds and form a linear string called the primary structure of the

    protein. These peptide bonds are formed between the amino and carboxylic groups of

    two amino acids by producing one water molecule [14]. Primary structure of the protein

    is also called amino acids sequence and normally available in FASTA format. Some

    parts of protein sequences have a biological significant pattern called a motif. For

    example, IQ Calmodulin-binding motif has the following sequence pattern:

    Figure 2.2. Different levels of protein structure. Different amino acids joined

    together in various combinations through covalent bonds and form primary structure.

    Different sections of primary structure fold together through backbone hydrogen

    bonding and form alpha helix and beta sheets. Elements in secondary structure again

    fold through side chain interactions to from tertiary structure stabilized by ionic

    bonds, disulfide bonds, hydrophobic interactions, and hydrogen bonding. Protein

    quaternary structures are formed through interaction or binding of two or more

    independent tertiary structures.

  • Problem Formulation and Literature Survey

    10

    [FILV]Qxxx[RK]Gxxx[RK]xx[FILVWY], where x represents any amino acid and

    square brackets represent alternative amino acids.

    Bonds which link the amino group to the alpha carbon and the alpha carbon

    atom to the carbonyl carbon are free to rotate and allows various orientations of amino

    acids in the polypeptide chain. Rotations of these bonds are represented in 𝜙 and 𝜓

    torsion angles and all possible allowed values of these angles are shown in the

    Ramachandran plot [23]. These allowed rotations of peptide bonds let polypeptide to

    fold into secondary structures such as alpha helix and beta sheets (see Fig. 2.2). These

    secondary structures are formed and stabilized by the hydrogen bonding between the

    backbone atoms of the residues. Alpha helix is generated by the hydrogen bonding of

    neighbor residues while beta sheets are formed through hydrogen bonding of distant

    residues in the sequence [14].

    Different secondary structures joined together and fold into a protein native 3-

    D tertiary structure as shown in Fig. 2.2. In protein tertiary structure, various interactive

    forces between atoms of side chains of residues in a polypeptide play an important role

    [14]. These interactive forces include hydrogen bonding, ionic bonds, disulfide bonds,

    and hydrophobic interactions. Amino acids sequence of a stable protein contains

    enough information to fold into native tertiary structure [24]. This 3-D structure of a

    protein determines its functions [14]. Protein 3-D structures are available in the form

    of coordinates of atoms of all residues in the PDB format.

    Protein quaternary structures are formed when various tertiary structures joined

    together (see Fig. 2.2). Protein quaternary structures result from protein interactions

    and are also called protein complexes. In the formation of these quaternary structures,

    similar interactive forces are involved as involved in tertiary structure formation.

    2.1.2 Protein Functions

    Proteins are involved in all the important biological processes and perform almost all

    tasks at the cellular level. Proteins are diverse in their functions and are responsible for

    cell shape, product manufacture, routine maintenance, waste cleanup, and inner

    organization. Proteins perform their roles as enzymes, antibodies, structural

    component, or messenger as shown in Fig. 2.3 [25]–[28]. There are thousands of

    chemical reactions involved in different metabolic pathways within a cell. These

  • Problem Formulation and Literature Survey

    11

    chemical reactions are catalyzed to proceed millions of times faster by proteins called

    enzymes [25]. For example, sucrase catalyzes the hydrolysis of sucrose. Antibodies are

    proteins which are used by the immune system of an organism to identify and neutralize

    foreign invaders such as viruses or pathogenic bacteria, e.g., T-cell receptors are

    proteins that act as antibodies. Antibodies perform their function through interaction

    with an antigen present on the surface of the invading organism [26]. Proteins such as

    Actin also provide structural support to the cell and enable it to dynamically remodel

    itself in response to internal or environmental stimuli [27]. Cells also communicate to

    coordinate and perform basic activities such as tissue repair and immunity. Insulin is a

    protein which helps in glucose and lipid metabolism by activating a cascade of cellular

    processes through interaction with insulin receptor tyrosine kinase (IR) [29]. Growth

    hormones are proteins which stimulate tissue repair through cell regeneration in human

    [28].

    Most protein functions involve the interaction of two or more proteins to form

    a protein complex [14]. For example, enzymes must bind to their substrates to perform

    catalysis and structural proteins bind together in order to gain strength and toughness

    [14]. Similarly, antibodies and messenger proteins also perform their functions by

    interacting with other proteins. Therefore, the study of protein interactions is of utmost

    importance in biology to decipher functions of proteins, to characterize different

    biological processes or pathways, to interpret disease mechanisms and design effective

    Figure 2.3. Protein functions. Proteins perform their functions as enzymes

    (Sucrase), antibodies (T-cell receptor), messenger (Insulin), or structural component

    (Actin). The most fundamental function that proteins perform and which underpin all

    the other biochemical functions is their ability to bind or interact with other proteins

    or macromolecule.

  • Problem Formulation and Literature Survey

    12

    drugs. Proteins interact with other proteins and macromolecules such as DNA and RNA

    but in this dissertation, we specifically focus on protein-protein interactions (PPIs)

    studies. In the next sections of this chapter, we discuss how proteins interact with other

    proteins, biological significant problems in protein interactions, and how computational

    techniques can contribute to handling these problems.

    2.2 Protein Interactions and Complex Formation

    Proteins generally do not function in isolation and interact with each other to perform

    a vital role in various biological processes and metabolic pathways [11], [14]. More

    than 80% of all the cellular proteins are involved in these type of interactions [10], [11].

    Protein-protein interactions (PPIs) are physical connections between residues of two

    proteins (Ligand and Receptor) in a highly specific manner (see Fig. 2.4). These

    interactions happen in a specific biomolecular context and are normally piloted by a

    chain of the same electrostatic forces and hydrophobic effects as involved in protein

    folding [14], [30]. Complementarity in shape and charge distribution on the surface of

    proteins are the two major factors which play a significant role in protein interactions

    [14]. During an interaction, proteins can also go through conformational changes

    augmented by conformational selections model [14], [31]. These conformational

    changes enable optimized interaction and support formation of a stable complex.

    Protein complexes which are formed through protein-protein interactions as

    shown in Fig. 2.4. These protein interactions constitute the interactome of an organism.

    Protein interactions can happen within the same organism (intra-species) and across

    different organisms (inter-species) such as host-pathogen protein interactions (HPIs).

    Studies in protein interactions with different perspectives such as molecular dynamics,

    Figure 2.4. Protein interaction. Two unbound proteins (Ligand and Receptor) with

    complementarity in shape and charge distribution interact with each other to form a

    protein complex. The interface of the complex at 6Å distance threshold is shown with

    sticks in magenta color.

  • Problem Formulation and Literature Survey

    13

    biochemistry, and signal transduction create protein interaction networks. These protein

    interaction networks, like metabolic pathways, help biologists to gain a better

    understanding of underlying biological processes, to understand disease mechanisms

    and to aid studies for the design, discovery, and effectiveness of therapeutic drugs [32].

    2.2.1 Binding Affinity of Interacting Proteins

    Binding affinity is a measure of the strength of interaction between proteins which bind

    reversibly in a protein complex [5], [7]. High binding affinity indicates tighter binding

    between proteins involved in an interaction. Experimentally, it can be measured in

    terms of the dissociation constant (𝐾𝑑 =[𝐿][𝑅]

    [𝐿𝑅]⁄ ), which is a ratio between the

    concentration of free ligand and receptor proteins ([𝐿], [𝑅]) and the concentration of

    protein complex ([𝐿𝑅]) [7]. Smaller values of 𝐾𝑑 show high binding affinity and vice

    versa. Thermodynamically, the formation of protein complexes through protein

    interactions also involves loss in free energy [5]. Higher loss in free energy shows high

    binding affinity and results in a more stable protein complex. Therefore, binding

    affinity can also be measured by taking the difference between the free energy of the

    protein complex and the sum of free energies of unbound proteins. This difference is

    called change in the Gibbs free energy upon binding (∆∆𝐺). Binding affinity is usually

    very small ranges from -2.5 to -22 kcal/mol.

    2.2.2 Interfaces or Interaction Sites of Proteins

    When two proteins interact to form a protein complex, only a part of proteins is

    involved in binding as shown in Fig. 2.4. This part on one protein is called the

    interaction site of the protein whereas all the interacting residue pairs on both proteins

    constitute the interface of the protein complex (see Fig. 2.4). Therefore, in finding an

    interaction site, we are only interested in residues of one protein in a complex which

    are involved in interaction without considering residues of other protein. In contrast,

    while determining the interface of the protein complex, we find all residue pairs on both

    proteins which are involved in the interaction. It is interesting to note that if we have

    known interface of a protein complex then we can easily extract the interaction sites of

    interacting proteins in the complex.

    If we have a solved 3-D structure of a protein complex, then we can extract the interface

    of the complex by considering all those residue pairs of interacting proteins whose alpha

  • Problem Formulation and Literature Survey

    14

    carbon atoms are within a distance of 6.0 to 8.0 Angstroms [14], [33]. This approach of

    extracting an interface from the protein complex is quite trivial and has been used by

    many researchers in the field. However, residue pairs within this distance are not always

    guaranteed to be interacting [34].

    2.2.3 Types of Protein Interactions and Complexes

    Protein-protein interactions (PPIs) and formation of protein complexes can be

    differentiated based on the permanence of these complexes and the number of different

    protein chains that are involved in the interaction. Protein complexes can be homomeric

    or heteromeric as shown in Fig. 2.5 [35], [36]. Homomeric protein complexes are

    formed through the interaction of a single type of protein chains and these complexes

    are called as dimer, trimer and so on, based on the number of chains involved in

    complex formation. Most of the transcription regulatory factors and scaffolding

    proteins perform their functions as homomers [36]. In heteromeric protein complexes

    formation, distinct protein chains are involved in the interaction. In the cell signaling,

    heteromeric protein complexes are involved in the biochemical cascade [36].

    Figure 2.5. Types of protein interactions and complexes. Protein Complexes are

    homomeric if one type of protein chains is involved in interactions otherwise if

    various type of protein chains are involved in complex formation then those

    complexes are called heteromeric. Further protein complexes are divided into stable

    or transient based on the duration of interactions. Binding affinity is a measure of the

    strength of interaction between the protein involved in a complex formation. Binding

    affinity is measured in terms of disassociation constant (Kd) and binding affinity is high for low Kd values. Stable complexes have high and weak transient have low binding affinity.

  • Problem Formulation and Literature Survey

    15

    Protein interactions can be classified as stable or transient based on their

    interaction duration (see Fig. 2.5) [37]. Stable protein interactions involve those

    interactions which stay for a long time and make permanent complexes for different

    molecular roles [38]. In most of the homomeric and in some heteromeric stable

    interactions are involved. Core RNA polymerase and Hemoglobin are examples of

    stable complexes. In contrast, transient protein interactions occur reversibly for a short

    duration in a specific molecular context [38]. For example, most protein interactions in

    cell signaling are transient. Transient interactions control most cellular functions such

    as protein folding, protein modification, and cell cycling. Folding and binding are

    inseparable in case of stable complexes whereas, in transient complexes, proteins

    folding and binding are two separate entities.

    In this dissertation, we generally focus on heteromeric transient protein

    complexes regardless of their functions.

    2.2.4 Biologically Significant Effects of Protein Interactions

    Proteins interactions normally take place in a specific molecular context. Interacting

    proteins have certain underlying functional objectives which are expressed in various

    ways. Some of the measurable biological significant effects of protein interactions are

    listed as follows [39], [40].

    Activation or deactivation of a protein.

    Changing the interaction behavior of a protein by altering its binding specificity

    towards different binding partners.

    Regulate cellular functionality by participating either in upstream or

    downstream events.

    Creation a new binding mode in a protein.

    2.2.5 Problems of Interest in Protein Interactions

    Biologist and pharmacologists have various objectives in studying protein-protein

    interactions. Some of them are listed as follows.

    To get an idea of the function and behavior of proteins.

    To determine the biological process or a pathway in which a protein of unknown

    function is involved.

    To determine different binding modes of a protein.

  • Problem Formulation and Literature Survey

    16

    To determine the specificity of a protein towards multiple targets.

    To discover, design and measure the effectiveness of drugs and therapeutic

    agents.

    To combat infectious diseases.

    To promote or inhibit protein interactions.

    To design new proteins.

    To meet all the above objectives, biologists and drug designers are generally

    interested in solving the following three related problems in protein interactions.

    i) Protein Interaction: Whether two given proteins interact or not?

    ii) Binding Affinity: What is the strength of their interaction?

    iii) Interface or Interaction Site: What is the exact location of interaction?

    We perform a literature survey of existing experimental and computational

    methods of solving these problems in the following sections.

    2.3 Experimental Methods

    Several experimental methods have been developed to determine protein interaction,

    binding affinity, and interface or interaction site as shown in Fig. 2.6. These

    experimental procedures are performed in-vivo (within an organism) or in-vitro (outside

    organism). The problem of knowing whether two given proteins interact or not can be

    taken as a binary classification problem. Experimental methods of determining protein-

    protein interactions are classified as small-scale or high throughput methods [41], [42].

    Small-scale methods such as Co-immunoprecipitations [43] and Surface Plasmon

    Resonance [44] are often used to detect one interaction at a time. High throughput

    methods such as Yeast Two-Hybrid (Y2H) [45] and Tandem Affinity Purification

    (TAP) [46] are used to get thousands of interactions at a time. Binding affinity is the

    measure of the strength of interaction between two proteins. Experimental methods

    such as Isothermal Titration Calorimetry (ITC) [47], Surface Plasmon Resonance

    (SPR) [48], and Fluorescence Polarization (FP) [49] can be used to determine protein

    binding affinity. Interface or binding site is the region of proteins that are involved in

    the interaction. In order to determine Interface or binding site, there also exist some

    experimental procedures such as X-ray crystallography [50], Nuclear Magnetic

    Resonance (NMR) [51] and different biological assays such as site-directed

    mutagenesis [52]. A detailed discussion of these experimental techniques is out of the

  • Problem Formulation and Literature Survey

    17

    scope of this dissertation as the primary focus of this study is on computational

    techniques. Interested readers are referred to [43]–[52] for further details. Here, we

    provide briefly, some shortcomings of these experimental techniques.

    Experimental techniques can accurately determine protein-protein interactions

    (PPIs) but these techniques are expensive and time-consuming [39], [53], [54]. In the

    meanwhile, high throughput methods produce many false positives and false negatives.

    Moreover, these methods are difficult to reproduce and have limited coverage [41].

    Furthermore, experimental methods depend on laboratory protocols and experimental

    conditions which make it difficult to have an unbiased comparison across different

    studies. Due to these shortcomings in experimental techniques, accurate computational

    methods for protein interaction, binding affinity, and interface prediction are required.

    2.4 Computational Methods

    Cost and time constraints of experimental methods make them infeasible for their large-

    scale applications at an interactome level of an organism. Therefore, there is high

    demand for accurate computational approaches to support wet-lab methods by

    predicting and ranking probable PPIs. Such computational approaches can assist

    biologists in focusing on most likely interactions [55]. Several computational

    techniques exist in the literature for protein-protein interaction problems. These

    computational techniques can roughly be categorized into classical and machine

    learning based methods. In this study, we focus on machine learning based methods

    while classical methods are not within the scope of this dissertation. However, we give

    a brief detail of these classical computational techniques in the next section to show

    their limitations and to highlight the importance of machine learning based techniques

    in solving protein interactions related problems.

    Figure 2.6. Experimental methods to determine protein interactions, binding

    affinity, and interaction site or interface.

  • Problem Formulation and Literature Survey

    18

    2.4.1 Classical Computational Methods

    A number of computational methods, other than machine learning based techniques,

    have been developed to determine protein interaction, binding affinity, and interface or

    interaction site of proteins in a protein complex as shown in Fig. 2.7. These methods

    have been grouped as homology-based (Interolog Search, Phylogenetic Similarity, and

    Template based), simulations based (Molecular Dynamic Simulation), and others (Text

    Mining, Network Topology Based, Docking, Energy Perturbation and Empirical

    Scoring). A detailed discussion of these methods is not within the scope of this study

    as our primary focus is on machine learning techniques. However, interested readers

    are referred to [7], [10], [56]–[58] for further study. Here, we provide a brief overview

    of these techniques along with their inherent limitations.

    Homology-Based Methods: Homology-based methods take a basic

    assumption of conserved protein interactions among different organisms. In Interolog

    search, protein-protein interactions are predicted based on the homology of proteins

    across different organism as shown in Fig. 2.8(a) [59]–[64]. Methods such as Molecular

    Interaction Search Tool (MIST) and BIP-BIANA (Biologic Interactions and Network

    Analysis) have been proposed and made accessible through their webserver for PPI

    prediction through Interolog search [61], [65]. A similar approach followed in

    Figure 2.7. Classical computational methods to predict protein interactions,

    binding affinity, and interaction site or interface of a protein complex.

  • Problem Formulation and Literature Survey

    19

    homology-based methods is a phylogenetic


Recommended