Introduction
Applications of Secondary structure prediction:
• Protein Folding research• Fold Recognition
• Protein classification
• Homology Modeling• Ab initio methods
Evolution Of Prediction Accuracy
1978 Chou & Fasman 57 %
1989 Qian & Seynowski 66 %
1993 Rost & Sander 71 %
Present – Up to 78% ?
Accuracy in present methods of secondary structure prediction methods
Method Q3 CorrH CorrE CorrCPROF 77.0 0.67 0.65 0.56
PSIPRED 76.6 0.66 0.64 0.56
SSpro 76.3 0.67 0.64 0.56
JPred2 75.2 0.65 0.63 0.54
PHDpsi 75.1 0.64 0.62 0.53
PHD 71.9 0.59 0.59 0.59
Copenhagen 78
Neural Networks
• Neural networks are rather trained then programmed to carry out chosen information processing tasks
• Training neural network involves adjusting the network so that is able to produce specific output for each of given set of input patterns
• Since the desired data are known in advance, training a feed forward network is a supervised learning.
• Back propagation algorithm – Each error in recognition on output effects with reaction of back correction in parameters of activation function.
Computing Output
• The output of each node is weighted sum of its input
• Then squashing function (monotonic and differentiable is used to gain values from 0 to 1
pjj ijpi ywa ,, ∑=
x
pipi
exf
afy
−+=
=
1
1)(
)( ,,
Training a feed forward net
• Training was performed using SNNS (Stuttgart Neural Network System) package
• Network architecture and weights were exported using ssns2c program from SNNS package
• Own Perl programs was used to preparing data and benchmarking network
Training a feed forward net
• Feed forward nets are trained using set of patterns known as a training set. Desired output are known in advance
• Every pattern have the same number of elements that input nodes
• On-line training is to find number of epoch which is optimal in generalisation. It means that net is well trained on training set and good recognize test set.
Network architecture
• Number of input nodes (windows size) – 11 aa|VLSPADKTNVK|AAWGKVGAHAGEYGAEALERMFLSFP
Predicted state
|00000VLSPAD|KTNVKAAWGKVGAHAGEYGAEALERM
11 x 20 = 220
C
C
-101 -186 -387 -302 -165 -403 -387 341 -302 36 15 -387 -302 -302 -387 -286 -101 545 -387 -186
Network ArchitectureInput Hidden Output Q3
300 20 3 71.277
260 20 3 72.742
220 20 3 73.428
180 20 3 70.083
Q3 – Corelation Coeficient - Percentage of correctly predicted residues
MSE = Mean Square Errror.2
1
21
1
)()(1
)( txptxfn
tMSEk
iii
k
ii −=−= ∑∑
==
MSE
Epoch
220 180
260300
Network ArchitectureInput Hidden Output Q3
220 40 3 71.567
220 30 3 71.466
220 20 3 73.428
220 10 3 70.345
2030
10
40
Preparing data
• Two sets of 118 non-homologous proteins• Each sequence was used as a query for PSI-BLAST. (cut of 0.001 database
nr)• Then gained profile was normalized using squashing function.
• Secondary structure information for training comes from PDB database:• DSSP classes were collapsed to three state target classes:
H,G – helix
E – strand
B,I,S,T,e,g,h – coil
• Two sets – training and test were built. About 20000 patterns for 220/20/3
Architecture network.
Results
• C program using PSI-BLAST or fake profile generate secondary structure prediction in three letter code or in PSIPRED format:(0.234 0.135 0.679 ).
• PHP based web service (not benchmarked yet)
• Percentage of correctly predicted residues Q3 = 73.428
Strategies to increase accuracy
• Adding new types of biological information
• Change the way that information is presented to the network
• Post process the network predictions
• Change the network architecture
Strategies to increase accuracy
Biological Information:
• Hydrophobicity, charge, backbone properties
• Length of chain – additional input
• Distance to N & C terminal aa
• Non local information. (all alpha, all beta etc.)
Strategies to increase accuracy
Pattern level information:
• Increase amount of sequences
• Balance data (e.g. Equal number of helices an strands)
• Window information from other parts of amino acids
Strategies to increase accuracy
Post processing and filtering:
• Second level network: structure to structure – in development
• Simple voting of numerous network with other parameters – in development
Strategies to increase accuracy
Changing the architecture:
• Number of hidden nodes - Done
• Number of input window – Done
• Other learning algorithm