Data Mining Protein Structures' Topological Properties to Enhance Contact Map Predictions

Data Mining Protein Structures' Data Mining Protein Structures' Topological Properties to Enhance Topological Properties to Enhance

Contact Map PredictionsContact Map Predictions

Dr. Jaume BacarditDr. Jaume Bacardit

School of Computer Science and School of School of Computer Science and School of BiosciencesBiosciences

University of NottinghamUniversity of Nottingham

[email protected]

Weizmann Institute of Sciences, May 27th, 2010

PrefacePreface

General context of the talk is Protein Structure General context of the talk is Protein Structure Prediction (PSP)Prediction (PSP)

Specifically, this talk describes our Contact Map Specifically, this talk describes our Contact Map (CM) prediction method that was one of the top (CM) prediction method that was one of the top predictors in the last edition of CASPpredictors in the last edition of CASP

CASP = Critical Assessment of Techniques for CASP = Critical Assessment of Techniques for Protein Structure Prediction. Biannual Protein Structure Prediction. Biannual community-wide experiment to assess the state-community-wide experiment to assess the state-of-the-art in PSPof-the-art in PSP

The use of topological models of protein The use of topological models of protein structure has contributed to better CM predictionstructure has contributed to better CM prediction

RoadmapRoadmap Protein Structure Prediction (PSP)Protein Structure Prediction (PSP) Topological properties of protein residues Topological properties of protein residues

(TP)(TP) Our contact map predictor (CM)Our contact map predictor (CM) Contact Map Prediction at CASP9 (CASP)Contact Map Prediction at CASP9 (CASP) What insight can we extract from the method? What insight can we extract from the method?

(INS) (INS)

PSP TP CM CASP INS

PROTEIN STRUCTURE AND PROTEIN STRUCTURE AND CONTACT MAP PREDICTIONCONTACT MAP PREDICTION

PSP TP CM CASP INS

Protein Structure PredictionProtein Structure Prediction

Protein Structure Prediction (PSP) aims to predict the 3D Protein Structure Prediction (PSP) aims to predict the 3D structure of a protein based on its primary sequencestructure of a protein based on its primary sequence

Primary Sequence 3D Structure

Why PSP?Why PSP?

PSP remains, after many years, one of the main PSP remains, after many years, one of the main challenges in computational biologychallenges in computational biology

The function of a protein is determined by its The function of a protein is determined by its structurestructure

Thus, algorithms for predicting a protein’s structure Thus, algorithms for predicting a protein’s structure will aidwill aid Understanding a protein’s function and characterising its Understanding a protein’s function and characterising its

binding sitesbinding sites Producing antibodies for immunolocalisationProducing antibodies for immunolocalisation And looking far beyond…. designing new proteins (better And looking far beyond…. designing new proteins (better

crops, more efficient drugs, etc.)crops, more efficient drugs, etc.)

PSP: A family of problemsPSP: A family of problems

There are several There are several kinds kinds of prediction problems of prediction problems within the scope of PSPwithin the scope of PSP

The main one is to predict the 3D coordinates of The main one is to predict the 3D coordinates of all atoms of a protein (or at least the backbone) all atoms of a protein (or at least the backbone) based on its primary sequencebased on its primary sequence

There are many There are many structural propertiesstructural properties of individual of individual residues within a protein that can be predictedresidues within a protein that can be predicted Secondary structure (SS), solvent accessibility (SA)Secondary structure (SS), solvent accessibility (SA)

Accurate predictions of these sub-problems are a Accurate predictions of these sub-problems are a stepping stone towards the general 3D problemstepping stone towards the general 3D problem

TOPOLOGICAL PROPERTIES TOPOLOGICAL PROPERTIES OF PROTEINSOF PROTEINS

PSP TP CM CASP INS

Contact MapContact Map Two residues of a chain are said Two residues of a chain are said

to be in contact if their distance is to be in contact if their distance is less than a certain thresholdless than a certain threshold

The contacts of a protein can be The contacts of a protein can be represented by a binary matrix. 1 represented by a binary matrix. 1 = contact 0 = non contact= contact 0 = non contact

Plotting this matrix reveals many Plotting this matrix reveals many characteristics from the protein characteristics from the protein structurestructure

CM prediction CM prediction is used in many is used in many 3D PSP methods (e.g. I-Tasser)3D PSP methods (e.g. I-Tasser)

helices sheets

Contact

Recursive Convex HullRecursive Convex Hull Structural feature that we have Structural feature that we have

proposed recently [Stout, Bacardit, proposed recently [Stout, Bacardit, Hirst & Krasnogor, Hirst & Krasnogor, Bioinformatics Bioinformatics 2008 24(7):916-923;2008 24(7):916-923;]]

We model a protein as a series of We model a protein as a series of nested layers, assigning each nested layers, assigning each residue to a different layerresidue to a different layer

Strictly speaking each layer is a Strictly speaking each layer is a convex hull of pointsconvex hull of points

The convex hull of a point set is The convex hull of a point set is simple and fast to computesimple and fast to compute

Recursive Convex Hull is Recursive Convex Hull is computed by iteratively identifying computed by iteratively identifying the layers (hulls) of a proteinthe layers (hulls) of a protein

Relation of RCH to other Relation of RCH to other structural propertiesstructural properties

ComparingComparing Solvent Solvent

AccessiblityAccessiblity Exposure Exposure

[Ben-Shimon and [Ben-Shimon and Eisenstein;05]Eisenstein;05]

Residue Residue depth depth [Chakravarti and [Chakravarti and Varadarajan;99]Varadarajan;99]

RCH/RCHrRCH/RCHr

Correlation between featuresCorrelation between features

Proximity Graphs (PGs)Proximity Graphs (PGs)DT DT GG RNG MST ⊇ ⊇ ⊇ GG RNG MST ⊇ ⊇ ⊇

Poupon: 2004

Delanuy Tessellation of a point set

QHull: Barber, C.B., Dobkin, D.P., and Huhdanpaa, H.T., "The Quickhull algorithm for convex hulls," ACM Trans. on Mathematical Software, 22(4):469-483, Dec 1996

Proximity Graphs (PGs)Proximity Graphs (PGs)DT DT GG RNG MST ⊇ ⊇ ⊇ GG RNG MST ⊇ ⊇ ⊇

Minimum Spanning Tree (MST)Minimum Spanning Tree (MST) Search for shortest path in RNGSearch for shortest path in RNG

Remove edges from DT if a Remove edges from DT if a sphere drawn between the sphere drawn between the vertices contains another vertexvertices contains another vertex Gabriel Graph (GG)Gabriel Graph (GG)

Remove edges from GG if an Remove edges from GG if an sherical lune contains another sherical lune contains another vertexvertex Relative Neighbourhood Graph Relative Neighbourhood Graph

(RNG)(RNG)

Residue Packing DensityResidue Packing DensityProtein 153LProtein 153L Proximity Graphs

Contact Map

Public calculation server: http://lobelia.cs.nott.ac.uk/psp/newInterface/

Predictability of RCHPredictability of RCH

Using a Using a variety of variety of Machine Machine Learning Learning methodsmethods

Is RCH more predictable than Is RCH more predictable than other features?other features?

RCHr RCHr RCH RCH RD RD Exp Exp SA SA

But is it useful?But is it useful?

Using these Using these predictions predictions to help to help predict predict better CNbetter CN

RCH and RCH and SA are the SA are the most useful most useful predictorspredictors

OUR CONTACT MAP OUR CONTACT MAP PREDICTION METHODPREDICTION METHOD

PSP TP CM CASP INS

StepsSteps

1.1. Prediction ofPrediction of Secondary structure (using PSIPRED)Secondary structure (using PSIPRED) Solvent AccessibilitySolvent Accessibility Recursive Convex HullRecursive Convex Hull Coordination NumberCoordination Number

2.2. Integration of all these predictions plus Integration of all these predictions plus other sources of informationother sources of information

3.3. Final CM prediction (using BioHEL)Final CM prediction (using BioHEL)

Using BioHEL [Bacardit et al., 09]

Prediction of RCH, SA and CNPrediction of RCH, SA and CN

We selected a set of 2811 protein chains We selected a set of 2811 protein chains from PDB-REPRDB with:from PDB-REPRDB with: A resolution less than 2ÅA resolution less than 2Å Less than 30% sequence identifyLess than 30% sequence identify Without chain breaks nor non-standard Without chain breaks nor non-standard

residuesresidues

90% of this set was used for training 90% of this set was used for training (~490000 residues)(~490000 residues)

10% for test 10% for test

How are these features How are these features predicted?predicted?

Many of these features are due to local Many of these features are due to local interactions of an amino acid and its immediate interactions of an amino acid and its immediate neighbours neighbours We predict them from the closest neighbours We predict them from the closest neighbours

in the chainin the chain

Ri

SSi

Ri+1

SSi+1

Ri-1

SSi-1

Ri+2

SSi+2

Ri-2

SSi-2

Ri+3

SSi+3

Ri+4

SSi+4

Ri-3

SSi-3

Ri-4

SSi-4

Ri-5

SSi-5

Ri+5

SSi+5

Ri-1 Ri Ri+1 SSi

Ri Ri+1 Ri+2 SSi+1

Ri+1 Ri+2 Ri+3 SSi+2

Prediction of RCH, SA and CNPrediction of RCH, SA and CN

All three features were predicted based on All three features were predicted based on a window of ±4 residues around the targeta window of ±4 residues around the target Evolutionary information (as a Position-Evolutionary information (as a Position-

Specific Scoring Matrix) is the basis of this Specific Scoring Matrix) is the basis of this local informationlocal information

Each residue characterised by a vector of 180 Each residue characterised by a vector of 180 valuesvalues

The domain for all three features was The domain for all three features was partitioned into 5 statespartitioned into 5 states

Characterisation of the contact Characterisation of the contact map problemmap problem

Three types of input information were usedThree types of input information were used1.1. Detailed information of three different windows of Detailed information of three different windows of

residues centered aroundresidues centered around The two target residues (2x)The two target residues (2x) The middle point between themThe middle point between them

2.2. Information about the connecting segment between Information about the connecting segment between the two target residues and the two target residues and

3.3. Global protein information. Global protein information.

1

2

3

Contact Map datasetContact Map dataset

The set of 2811 proteins was randomly The set of 2811 proteins was randomly halved halved

Moreover, all proteins with more than 350 Moreover, all proteins with more than 350 amino acids were discardedamino acids were discarded

Still, the resulting training set contained more Still, the resulting training set contained more than 15.2 million instances and 631 than 15.2 million instances and 631 attributesattributes

Less than 2% of those are actual contactsLess than 2% of those are actual contacts 36GB of disk space36GB of disk space

Samples and ensemblesSamples and ensembles

50 samples of 300K examples are 50 samples of 300K examples are generated from the training set with generated from the training set with a ratio of 2:1 non-contacts/contacts a ratio of 2:1 non-contacts/contacts

BioHEL is run 25 times for each BioHEL is run 25 times for each samplesample

Prediction is done by a consensus Prediction is done by a consensus of 1250 rule setsof 1250 rule sets

Confidence of prediction is Confidence of prediction is computed based on the votes computed based on the votes distribution in the ensemble. distribution in the ensemble.

Whole training process takes about Whole training process takes about 289 CPU days (~5.5h/rule set)289 CPU days (~5.5h/rule set)

Training set

x50

x25

Consensus

Predictions

Samples

Rule sets

CONTACT MAP PREDICTION CONTACT MAP PREDICTION AT CASP9AT CASP9

PSP TP CM CASP INS

Contact Map prediction in CASPContact Map prediction in CASP Contact Map is assessed using the 11 CASP Contact Map is assessed using the 11 CASP

targets in the targets in the Free ModellingFree Modelling category category Also, only long-range contacts (with a minimum Also, only long-range contacts (with a minimum

chain separation of 24 residues) are evaluatedchain separation of 24 residues) are evaluated Predictor groups are asked to submit a list of Predictor groups are asked to submit a list of

predicted contacts and a confidence level for predicted contacts and a confidence level for each predictioneach prediction

The assessors then rank the predictions for The assessors then rank the predictions for each protein and take a look at the top L/x ones, each protein and take a look at the top L/x ones, where L is the length of the protein and x={5,10}where L is the length of the protein and x={5,10}

Contact Map prediction in CASPContact Map prediction in CASP

From these L/x top ranked contacts two From these L/x top ranked contacts two measures are computedmeasures are computed Accuracy: TP/(TP+FP)Accuracy: TP/(TP+FP) Xd: difference between the distribution of Xd: difference between the distribution of

predicted distance and a random distributionpredicted distance and a random distribution

22 groups participated in casp8, but not all 22 groups participated in casp8, but not all of them sent enough predictions for L/10 of them sent enough predictions for L/10 or L/5or L/5

Accuracy ResultsAccuracy Results Accuracy for groups that predicted a Accuracy for groups that predicted a

common subset of targetscommon subset of targets

Ezkudia et al. Proteins 2009; 77(Suppl 9):196-209

Xd resultsXd results


L/10 prediction L/10 prediction for target T0443-for target T0443-D1D1

67% accuracy67% accuracy


WHAT INSIGHT CAN WE WHAT INSIGHT CAN WE EXTRACT FROM THE EXTRACT FROM THE METHOD? METHOD?

PSP TP CM CASP INS

Is all that information useful?Is all that information useful?

Many different types of information were Many different types of information were used to perform the predictionused to perform the prediction

Is all of it relevant?Is all of it relevant? As BioHEL generates human-readable As BioHEL generates human-readable

sets of rules we can address this questionsets of rules we can address this question

Rule generated by BioHELRule generated by BioHEL

Att PredSS_r1_1 is E,X Att PredSS_r1_1 is E,X andand Att Att PredRCH_r1 is 4 Att PredCN_r1_-1 is PredRCH_r1 is 4 Att PredCN_r1_-1 is 0,2,3,4,X 0,2,3,4,X andand Att PredCN_r2_1 is 3,4 Att PredCN_r2_1 is 3,4 andand Att AA_freq_central_P=0 Att AA_freq_central_P=0 andand Att Att AA_freq_global_E is [0.02,0.10] AA_freq_global_E is [0.02,0.10] andand Att Att PSSM_r2_-1_Y is [-7,9.69] PSSM_r2_-1_Y is [-7,9.69] andand Att Att PSSM_r2_0_I is [1.76,8] PSSM_r2_0_I is [1.76,8] then then contactcontact

8 attributes in this rule out of 631 (in 8 attributes in this rule out of 631 (in average 8.3 att/rule)average 8.3 att/rule)

Understanding the rule setsUnderstanding the rule sets

Each rule set has in average 135 rulesEach rule set has in average 135 rules We have a total of 168470 rulesWe have a total of 168470 rules Impossible to read all of them individually, Impossible to read all of them individually,

but we can extract useful statisticsbut we can extract useful statistics For instance, how often was each attribute For instance, how often was each attribute

used in the rules? used in the rules?

Distribution of frequency of use Distribution of frequency of use of attributesof attributes

All 631 attributes are All 631 attributes are actually used (min actually used (min frequency=429)frequency=429)

However, some of However, some of them are used much them are used much more frequently than more frequently than othersothers

Top 10 attributesTop 10 attributesAttribute Frequency Counts

PredSS_r1_1 1.48% 18141

PredCN_r1 1.66% 20336

propensity 1.74% 21288

PredSS_r2 1.75% 21350

PredSS_r1 1.82% 22205

PredRCH_r2 1.87% 22856

PredRCH_r1 2.04% 24961

PredSA_r2 2.12% 25891

PredSA_r1 2.39% 29246

separation 4.17% 50951

The four kind of residue’s predictions are highly ranked

Beyond individual attributes…Beyond individual attributes…

We can also identify when certain pairs (or We can also identify when certain pairs (or triplets) of attributes appear always triplets) of attributes appear always together in rulestogether in rules

Rules for alpha helices or beta sheetsRules for alpha helices or beta sheets And not just take a look at the attributes, And not just take a look at the attributes,

but also at the actual patterns of but also at the actual patterns of predicatespredicates

ConclusionsConclusions Our method was one of the top performing CM Our method was one of the top performing CM

predictors in CASP8predictors in CASP8 Combination of novel topological features (RCH) and a Combination of novel topological features (RCH) and a

robust data mining methodrobust data mining method Our BioHEL rule-based data mining method is able to Our BioHEL rule-based data mining method is able to

Generate competent predictionsGenerate competent predictions Extract explanations from the predictionsExtract explanations from the predictions

Still a lot of room for improvementStill a lot of room for improvement Better ranking of predictionsBetter ranking of predictions Alternative formulation of sub-predictionsAlternative formulation of sub-predictions Correlated mutations Correlated mutations

CM prediction. Is it worth it?CM prediction. Is it worth it?

CM predictors (blue) vs contacts derived CM predictors (blue) vs contacts derived from 3D PSP methods (orange)from 3D PSP methods (orange)

In CASP8 for the first time the CM In CASP8 for the first time the CM methods were competentmethods were competent

AcknowledgementsAcknowledgements Many thanks to the members of our Many thanks to the members of our Infobiotics Infobiotics

team in CASP8team in CASP8 Prof. Natalio KrasnogorProf. Natalio Krasnogor Prof. Jonathan HirstProf. Jonathan Hirst Dr. Michael StoutDr. Michael Stout

The UK Engineering and Physical Sciences The UK Engineering and Physical Sciences Research Council (EPSRC) under grant Research Council (EPSRC) under grant GR/T07534/01GR/T07534/01

The University of Nottingham’s High Performance The University of Nottingham’s High Performance Computing clusterComputing cluster

Date post:	14-Jun-2015
Category:	Technology
Upload:	jaumebp
View:	324 times
Download:	1 times

Data Mining Protein Structures' Topological Properties to Enhance Contact Map Predictions

Technology