Date post: | 14-Jun-2015 |
Category: |
Technology |
Upload: | jaumebp |
View: | 324 times |
Download: | 1 times |
Data Mining Protein Structures' Data Mining Protein Structures' Topological Properties to Enhance Topological Properties to Enhance
Contact Map PredictionsContact Map Predictions
Dr. Jaume BacarditDr. Jaume Bacardit
School of Computer Science and School of School of Computer Science and School of BiosciencesBiosciences
University of NottinghamUniversity of Nottingham
Weizmann Institute of Sciences, May 27th, 2010
PrefacePreface
General context of the talk is Protein Structure General context of the talk is Protein Structure Prediction (PSP)Prediction (PSP)
Specifically, this talk describes our Contact Map Specifically, this talk describes our Contact Map (CM) prediction method that was one of the top (CM) prediction method that was one of the top predictors in the last edition of CASPpredictors in the last edition of CASP
CASP = Critical Assessment of Techniques for CASP = Critical Assessment of Techniques for Protein Structure Prediction. Biannual Protein Structure Prediction. Biannual community-wide experiment to assess the state-community-wide experiment to assess the state-of-the-art in PSPof-the-art in PSP
The use of topological models of protein The use of topological models of protein structure has contributed to better CM predictionstructure has contributed to better CM prediction
RoadmapRoadmap Protein Structure Prediction (PSP)Protein Structure Prediction (PSP) Topological properties of protein residues Topological properties of protein residues
(TP)(TP) Our contact map predictor (CM)Our contact map predictor (CM) Contact Map Prediction at CASP9 (CASP)Contact Map Prediction at CASP9 (CASP) What insight can we extract from the method? What insight can we extract from the method?
(INS) (INS)
PSP TP CM CASP INS
PROTEIN STRUCTURE AND PROTEIN STRUCTURE AND CONTACT MAP PREDICTIONCONTACT MAP PREDICTION
PSP TP CM CASP INS
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction (PSP) aims to predict the 3D Protein Structure Prediction (PSP) aims to predict the 3D structure of a protein based on its primary sequencestructure of a protein based on its primary sequence
Primary Sequence 3D Structure
Why PSP?Why PSP?
PSP remains, after many years, one of the main PSP remains, after many years, one of the main challenges in computational biologychallenges in computational biology
The function of a protein is determined by its The function of a protein is determined by its structurestructure
Thus, algorithms for predicting a protein’s structure Thus, algorithms for predicting a protein’s structure will aidwill aid Understanding a protein’s function and characterising its Understanding a protein’s function and characterising its
binding sitesbinding sites Producing antibodies for immunolocalisationProducing antibodies for immunolocalisation And looking far beyond…. designing new proteins (better And looking far beyond…. designing new proteins (better
crops, more efficient drugs, etc.)crops, more efficient drugs, etc.)
PSP: A family of problemsPSP: A family of problems
There are several There are several kinds kinds of prediction problems of prediction problems within the scope of PSPwithin the scope of PSP
The main one is to predict the 3D coordinates of The main one is to predict the 3D coordinates of all atoms of a protein (or at least the backbone) all atoms of a protein (or at least the backbone) based on its primary sequencebased on its primary sequence
There are many There are many structural propertiesstructural properties of individual of individual residues within a protein that can be predictedresidues within a protein that can be predicted Secondary structure (SS), solvent accessibility (SA)Secondary structure (SS), solvent accessibility (SA)
Accurate predictions of these sub-problems are a Accurate predictions of these sub-problems are a stepping stone towards the general 3D problemstepping stone towards the general 3D problem
TOPOLOGICAL PROPERTIES TOPOLOGICAL PROPERTIES OF PROTEINSOF PROTEINS
PSP TP CM CASP INS
Contact MapContact Map Two residues of a chain are said Two residues of a chain are said
to be in contact if their distance is to be in contact if their distance is less than a certain thresholdless than a certain threshold
The contacts of a protein can be The contacts of a protein can be represented by a binary matrix. 1 represented by a binary matrix. 1 = contact 0 = non contact= contact 0 = non contact
Plotting this matrix reveals many Plotting this matrix reveals many characteristics from the protein characteristics from the protein structurestructure
CM prediction CM prediction is used in many is used in many 3D PSP methods (e.g. I-Tasser)3D PSP methods (e.g. I-Tasser)
helices sheets
Contact
Recursive Convex HullRecursive Convex Hull Structural feature that we have Structural feature that we have
proposed recently [Stout, Bacardit, proposed recently [Stout, Bacardit, Hirst & Krasnogor, Hirst & Krasnogor, Bioinformatics Bioinformatics 2008 24(7):916-923;2008 24(7):916-923;]]
We model a protein as a series of We model a protein as a series of nested layers, assigning each nested layers, assigning each residue to a different layerresidue to a different layer
Strictly speaking each layer is a Strictly speaking each layer is a convex hull of pointsconvex hull of points
The convex hull of a point set is The convex hull of a point set is simple and fast to computesimple and fast to compute
Recursive Convex Hull is Recursive Convex Hull is computed by iteratively identifying computed by iteratively identifying the layers (hulls) of a proteinthe layers (hulls) of a protein
Relation of RCH to other Relation of RCH to other structural propertiesstructural properties
ComparingComparing Solvent Solvent
AccessiblityAccessiblity Exposure Exposure
[Ben-Shimon and [Ben-Shimon and Eisenstein;05]Eisenstein;05]
Residue Residue depth depth [Chakravarti and [Chakravarti and Varadarajan;99]Varadarajan;99]
RCH/RCHrRCH/RCHr
Correlation between featuresCorrelation between features
Proximity Graphs (PGs)Proximity Graphs (PGs)DT DT GG RNG MST ⊇ ⊇ ⊇ GG RNG MST ⊇ ⊇ ⊇
Poupon: 2004
Delanuy Tessellation of a point set
QHull: Barber, C.B., Dobkin, D.P., and Huhdanpaa, H.T., "The Quickhull algorithm for convex hulls," ACM Trans. on Mathematical Software, 22(4):469-483, Dec 1996
Proximity Graphs (PGs)Proximity Graphs (PGs)DT DT GG RNG MST ⊇ ⊇ ⊇ GG RNG MST ⊇ ⊇ ⊇
Minimum Spanning Tree (MST)Minimum Spanning Tree (MST) Search for shortest path in RNGSearch for shortest path in RNG
Remove edges from DT if a Remove edges from DT if a sphere drawn between the sphere drawn between the vertices contains another vertexvertices contains another vertex Gabriel Graph (GG)Gabriel Graph (GG)
Remove edges from GG if an Remove edges from GG if an sherical lune contains another sherical lune contains another vertexvertex Relative Neighbourhood Graph Relative Neighbourhood Graph
(RNG)(RNG)
Residue Packing DensityResidue Packing DensityProtein 153LProtein 153L Proximity Graphs
Contact Map
Public calculation server: http://lobelia.cs.nott.ac.uk/psp/newInterface/
Predictability of RCHPredictability of RCH
Using a Using a variety of variety of Machine Machine Learning Learning methodsmethods
Is RCH more predictable than Is RCH more predictable than other features?other features?
RCHr RCHr RCH RCH RD RD Exp Exp SA SA
But is it useful?But is it useful?
Using these Using these predictions predictions to help to help predict predict better CNbetter CN
RCH and RCH and SA are the SA are the most useful most useful predictorspredictors
OUR CONTACT MAP OUR CONTACT MAP PREDICTION METHODPREDICTION METHOD
PSP TP CM CASP INS
StepsSteps
1.1. Prediction ofPrediction of Secondary structure (using PSIPRED)Secondary structure (using PSIPRED) Solvent AccessibilitySolvent Accessibility Recursive Convex HullRecursive Convex Hull Coordination NumberCoordination Number
2.2. Integration of all these predictions plus Integration of all these predictions plus other sources of informationother sources of information
3.3. Final CM prediction (using BioHEL)Final CM prediction (using BioHEL)
Using BioHEL [Bacardit et al., 09]
Prediction of RCH, SA and CNPrediction of RCH, SA and CN
We selected a set of 2811 protein chains We selected a set of 2811 protein chains from PDB-REPRDB with:from PDB-REPRDB with: A resolution less than 2ÅA resolution less than 2Å Less than 30% sequence identifyLess than 30% sequence identify Without chain breaks nor non-standard Without chain breaks nor non-standard
residuesresidues
90% of this set was used for training 90% of this set was used for training (~490000 residues)(~490000 residues)
10% for test 10% for test
How are these features How are these features predicted?predicted?
Many of these features are due to local Many of these features are due to local interactions of an amino acid and its immediate interactions of an amino acid and its immediate neighbours neighbours We predict them from the closest neighbours We predict them from the closest neighbours
in the chainin the chain
Ri
SSi
Ri+1
SSi+1
Ri-1
SSi-1
Ri+2
SSi+2
Ri-2
SSi-2
Ri+3
SSi+3
Ri+4
SSi+4
Ri-3
SSi-3
Ri-4
SSi-4
Ri-5
SSi-5
Ri+5
SSi+5
Ri-1 Ri Ri+1 SSi
Ri Ri+1 Ri+2 SSi+1
Ri+1 Ri+2 Ri+3 SSi+2
Prediction of RCH, SA and CNPrediction of RCH, SA and CN
All three features were predicted based on All three features were predicted based on a window of ±4 residues around the targeta window of ±4 residues around the target Evolutionary information (as a Position-Evolutionary information (as a Position-
Specific Scoring Matrix) is the basis of this Specific Scoring Matrix) is the basis of this local informationlocal information
Each residue characterised by a vector of 180 Each residue characterised by a vector of 180 valuesvalues
The domain for all three features was The domain for all three features was partitioned into 5 statespartitioned into 5 states
Characterisation of the contact Characterisation of the contact map problemmap problem
Three types of input information were usedThree types of input information were used1.1. Detailed information of three different windows of Detailed information of three different windows of
residues centered aroundresidues centered around The two target residues (2x)The two target residues (2x) The middle point between themThe middle point between them
2.2. Information about the connecting segment between Information about the connecting segment between the two target residues and the two target residues and
3.3. Global protein information. Global protein information.
1
2
3
Contact Map datasetContact Map dataset
The set of 2811 proteins was randomly The set of 2811 proteins was randomly halved halved
Moreover, all proteins with more than 350 Moreover, all proteins with more than 350 amino acids were discardedamino acids were discarded
Still, the resulting training set contained more Still, the resulting training set contained more than 15.2 million instances and 631 than 15.2 million instances and 631 attributesattributes
Less than 2% of those are actual contactsLess than 2% of those are actual contacts 36GB of disk space36GB of disk space
Samples and ensemblesSamples and ensembles
50 samples of 300K examples are 50 samples of 300K examples are generated from the training set with generated from the training set with a ratio of 2:1 non-contacts/contacts a ratio of 2:1 non-contacts/contacts
BioHEL is run 25 times for each BioHEL is run 25 times for each samplesample
Prediction is done by a consensus Prediction is done by a consensus of 1250 rule setsof 1250 rule sets
Confidence of prediction is Confidence of prediction is computed based on the votes computed based on the votes distribution in the ensemble. distribution in the ensemble.
Whole training process takes about Whole training process takes about 289 CPU days (~5.5h/rule set)289 CPU days (~5.5h/rule set)
Training set
x50
x25
Consensus
Predictions
Samples
Rule sets
CONTACT MAP PREDICTION CONTACT MAP PREDICTION AT CASP9AT CASP9
PSP TP CM CASP INS
Contact Map prediction in CASPContact Map prediction in CASP Contact Map is assessed using the 11 CASP Contact Map is assessed using the 11 CASP
targets in the targets in the Free ModellingFree Modelling category category Also, only long-range contacts (with a minimum Also, only long-range contacts (with a minimum
chain separation of 24 residues) are evaluatedchain separation of 24 residues) are evaluated Predictor groups are asked to submit a list of Predictor groups are asked to submit a list of
predicted contacts and a confidence level for predicted contacts and a confidence level for each predictioneach prediction
The assessors then rank the predictions for The assessors then rank the predictions for each protein and take a look at the top L/x ones, each protein and take a look at the top L/x ones, where L is the length of the protein and x={5,10}where L is the length of the protein and x={5,10}
Contact Map prediction in CASPContact Map prediction in CASP
From these L/x top ranked contacts two From these L/x top ranked contacts two measures are computedmeasures are computed Accuracy: TP/(TP+FP)Accuracy: TP/(TP+FP) Xd: difference between the distribution of Xd: difference between the distribution of
predicted distance and a random distributionpredicted distance and a random distribution
22 groups participated in casp8, but not all 22 groups participated in casp8, but not all of them sent enough predictions for L/10 of them sent enough predictions for L/10 or L/5or L/5
Accuracy ResultsAccuracy Results Accuracy for groups that predicted a Accuracy for groups that predicted a
common subset of targetscommon subset of targets
Ezkudia et al. Proteins 2009; 77(Suppl 9):196-209
Xd resultsXd results
Ezkudia et al. Proteins 2009; 77(Suppl 9):196-209
L/10 prediction L/10 prediction for target T0443-for target T0443-D1D1
67% accuracy67% accuracy
Ezkudia et al. Proteins 2009; 77(Suppl 9):196-209
WHAT INSIGHT CAN WE WHAT INSIGHT CAN WE EXTRACT FROM THE EXTRACT FROM THE METHOD? METHOD?
PSP TP CM CASP INS
Is all that information useful?Is all that information useful?
Many different types of information were Many different types of information were used to perform the predictionused to perform the prediction
Is all of it relevant?Is all of it relevant? As BioHEL generates human-readable As BioHEL generates human-readable
sets of rules we can address this questionsets of rules we can address this question
Rule generated by BioHELRule generated by BioHEL
Att PredSS_r1_1 is E,X Att PredSS_r1_1 is E,X andand Att Att PredRCH_r1 is 4 Att PredCN_r1_-1 is PredRCH_r1 is 4 Att PredCN_r1_-1 is 0,2,3,4,X 0,2,3,4,X andand Att PredCN_r2_1 is 3,4 Att PredCN_r2_1 is 3,4 andand Att AA_freq_central_P=0 Att AA_freq_central_P=0 andand Att Att AA_freq_global_E is [0.02,0.10] AA_freq_global_E is [0.02,0.10] andand Att Att PSSM_r2_-1_Y is [-7,9.69] PSSM_r2_-1_Y is [-7,9.69] andand Att Att PSSM_r2_0_I is [1.76,8] PSSM_r2_0_I is [1.76,8] then then contactcontact
8 attributes in this rule out of 631 (in 8 attributes in this rule out of 631 (in average 8.3 att/rule)average 8.3 att/rule)
Understanding the rule setsUnderstanding the rule sets
Each rule set has in average 135 rulesEach rule set has in average 135 rules We have a total of 168470 rulesWe have a total of 168470 rules Impossible to read all of them individually, Impossible to read all of them individually,
but we can extract useful statisticsbut we can extract useful statistics For instance, how often was each attribute For instance, how often was each attribute
used in the rules? used in the rules?
Distribution of frequency of use Distribution of frequency of use of attributesof attributes
All 631 attributes are All 631 attributes are actually used (min actually used (min frequency=429)frequency=429)
However, some of However, some of them are used much them are used much more frequently than more frequently than othersothers
Top 10 attributesTop 10 attributesAttribute Frequency Counts
PredSS_r1_1 1.48% 18141
PredCN_r1 1.66% 20336
propensity 1.74% 21288
PredSS_r2 1.75% 21350
PredSS_r1 1.82% 22205
PredRCH_r2 1.87% 22856
PredRCH_r1 2.04% 24961
PredSA_r2 2.12% 25891
PredSA_r1 2.39% 29246
separation 4.17% 50951
The four kind of residue’s predictions are highly ranked
Beyond individual attributes…Beyond individual attributes…
We can also identify when certain pairs (or We can also identify when certain pairs (or triplets) of attributes appear always triplets) of attributes appear always together in rulestogether in rules
Rules for alpha helices or beta sheetsRules for alpha helices or beta sheets And not just take a look at the attributes, And not just take a look at the attributes,
but also at the actual patterns of but also at the actual patterns of predicatespredicates
ConclusionsConclusions Our method was one of the top performing CM Our method was one of the top performing CM
predictors in CASP8predictors in CASP8 Combination of novel topological features (RCH) and a Combination of novel topological features (RCH) and a
robust data mining methodrobust data mining method Our BioHEL rule-based data mining method is able to Our BioHEL rule-based data mining method is able to
Generate competent predictionsGenerate competent predictions Extract explanations from the predictionsExtract explanations from the predictions
Still a lot of room for improvementStill a lot of room for improvement Better ranking of predictionsBetter ranking of predictions Alternative formulation of sub-predictionsAlternative formulation of sub-predictions Correlated mutations Correlated mutations
CM prediction. Is it worth it?CM prediction. Is it worth it?
CM predictors (blue) vs contacts derived CM predictors (blue) vs contacts derived from 3D PSP methods (orange)from 3D PSP methods (orange)
In CASP8 for the first time the CM In CASP8 for the first time the CM methods were competentmethods were competent
AcknowledgementsAcknowledgements Many thanks to the members of our Many thanks to the members of our Infobiotics Infobiotics
team in CASP8team in CASP8 Prof. Natalio KrasnogorProf. Natalio Krasnogor Prof. Jonathan HirstProf. Jonathan Hirst Dr. Michael StoutDr. Michael Stout
The UK Engineering and Physical Sciences The UK Engineering and Physical Sciences Research Council (EPSRC) under grant Research Council (EPSRC) under grant GR/T07534/01GR/T07534/01
The University of Nottingham’s High Performance The University of Nottingham’s High Performance Computing clusterComputing cluster