1
A Four-Body Statistical Potential A Four-Body Statistical Potential For Protein Fold RecognitionFor Protein Fold Recognition
Bala Krishnamoorthy and Alex Tropsha
UNC Chapel Hill
Nov 17, 2003
2
Four-Body PotentialsFour-Body Potentials
OutlineOutline
Four-body statistical potentials
Application to folding simulations
Application to predictions from CASP5 and Livebench 6
Hypothesis
Motivation
3
Four-Body PotentialsFour-Body Potentials
MotivationMotivation
Knowledge of protein structure is essential to understand their function(s)
Number of proteins (sequences known) is growing exponentially
Traditional methods for determining protein structure (X-ray crystallography, NMR etc.) do not yield quick results
Need to develop statistical methods that help with protein fold recognition
4
Four-Body PotentialsFour-Body Potentials
HypothesisHypothesis
Specific nearest neighbor residue contacts in protein structures have non-random propensities for occurrence.
The propensities of occurrence of nearest neighbor clusters can be used to score compatibility between protein sequence and structure
5
Four-Body PotentialsFour-Body Potentials
SNAPPSNAPP
Simplicial Neighborhood Analysis of Protein Packing
2-D Packing2-D Packing
2-D: 3 neighbors in mutual contact
3-D: 4 neighbor clusters
3-D Packing3-D Packing
6
Four-Body PotentialsFour-Body Potentials
Objective definition of the nearest neighborhood of each residue is needed
Use the Voronoi diagram of the protein
- gives convex hulls around each residue (represented as a point) that define the nearest neighborhood of the residue
Delaunay triangulation – defined as the dual of the Voronoi diagram
7
Four-Body PotentialsFour-Body PotentialsTessellation of protein structure (in 3D)
Residues are represented by their side-chain centers (or by their C-α atoms)
Protein structure represented as an aggregate of space filling, non-intersecting and irregular tetrahedra
Nearest neighbor residues are identified as unique sets of four residues each
(tetrahedral quadruplets)
8
Four-Body PotentialsFour-Body Potentials
Four-body Statistical PotentialsFour-body Statistical Potentials
Denote each quadruplet by { i , j , k , l }
i,j,k and l can be any of the 20 amino acids
Total number of possible quadruplets is 8855
AALVVALITLKMYYYY …
9
Four-Body PotentialsFour-Body Potentials
Based on the back-bone connectivity of {i,j,k,l}, there can be five types of tetrahedra (indexed as 0,1,2,3 and 4 respectively )
The propensities of the {i,j,k,l} quadruplets of each type t could be used to develop four-body statistical potentials
10
Four-Body PotentialsFour-Body PotentialsFour-body compositional propensities of Delaunay simplicesFour-body compositional propensities of Delaunay simplices
a – individual AA frequencyalpijkl_t
C a i a j akp
tp
t – frequency of type t tetrahedra
C – combinatorial factor
ijkl_tp
qijkl_t log
fijkl_t
- observed frequency of occurrence in the training set of quad {ijkl} in a type t tetrahedron
f ijkl_t
pijkl_t- expected frequency of occurrence in the training set of residues i,j,k and l in a type t tetrahedron
i
11
Four-Body PotentialsFour-Body Potentials
diverse training set of 1166 protein chains with known structure
For a test conformation, the total log-likelihood score is calculated by adding the score for each tetrahedron in its Delaunay tessellation.
Higher Score ↔ better structure
12
Comparison of pre- and post-TS (transition) structure of CI2 vs. native CI2 *
*structures courtesy of Dr. E. Shaknovich, Harvard (Ref: J. Mol. Biol. 296 (2000) p1183-1188)
Pre-TS (six structures) Post-TS (20 structures) Native
MD Simulation of proteins
Four-Body PotentialsFour-Body Potentials
Go potentials (native structure specific) fail to discriminate between the three!
13
20
30
40
50
60
70
80
90
100
110
120
1 2 3 4 5 6 7 8 9 10 11 1213 14 1516 17 18 1920 21 2223 24 2526 27
instances (red-pre(6), yellow-post(20), green-native)
tota
l s
co
re
N.B. - The 5th pre-TS instance actually had a 0.10 probability of folding (the other five pre-TS structures had ~ 0 probability of folding)
Comparison of total scores for pre- and post-TS structures of CI2 vs. native CI2
Four-Body PotentialsFour-Body Potentials
14Profile ProCAM of Post-TS structure
V13
V31
L49
V51
V51
0
5
10
15
20
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
residue #
log
-lik
elih
oo
d s
core
pre
post
native
V13L8
A16
I20
I29V31
V47
L49
I57
Structure profiles of pre-TS vs. post-TS structure of CI2
Four-Body PotentialsFour-Body PotentialsFour-Body PotentialsFour-Body Potentials
15Pre-TS Post-TS
SNAPP analysis of pre-TS vs. post-TS structure of CI2
Four-Body PotentialsFour-Body Potentials
16
0
2
4
6
8
10
12
14
16
18
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56
residue #
log
-lik
elih
oo
d s
core
pre
post
native
Y8
L16F18
W35
A37
G46
I48
Y52
Structure profiles of pre-TS vs. post-TS structure of SH3
Four-Body PotentialsFour-Body Potentials
17
Four-Body PotentialsFour-Body Potentials
Scoring Livebench 6 and CASP5 predictions
Livebench Automated evaluation of structure prediction servers
Set 6 had 32 “easy” and 66 “hard” targets
CASP 5
3D coordinate models submitted for 56 targets
Native structure of 33 targets has been released
- rank 3D predictions using four-body potentials
- compare with the ranking using global structural similarity measures (like MaxSub)
18
Four-Body PotentialsFour-Body Potentials
To compare rankings, use predictive index (PI)
Here, E – experimental values, P – predicted values
19
Four-Body PotentialsFour-Body PotentialsLivebench 6
10 models for each target made by PMODELLER
PI for 28 “easy” targets and 38 “hard” targets
Easy <PI> Std(PI)
4B pot 0.83 0.20
MJ 0.70 0.39
PMOD 0.80 0.19
(at least one model had a non-zero MaxSub score)
Hard <PI> Std(PI)
4B pot 0.83 0.11
MJ 0.74 0.18
PMOD 0.84 0.15
20
Four-Body PotentialsFour-Body Potentials
CASP 5
For 18 targets (out of 33), the native structure ranked better than all predictions
For 26 (out of 33) targets, the native structure was ranked within the top 3.5 % of all the predictions
CASP5 <PI> Std(PI)
4B pot 0.61 0.18
MJ 0.39 0.20
CRMSD 0.63 0.22
21
Four-Body PotentialsFour-Body Potentials
Conclusions
A four-body statistical scoring function is developed based on the Delaunay tessellation of proteins
Discriminates native from decoy structures in most of the cases
Distinguishes pre- and post-transition state structures and the native structure from MD folding simulation trajectories
Highly effective in the accurate ranking of Livebench 6 and CASP5 predictions